arxiv: 2511.11232 · v2 · submitted 2025-11-14 · 💻 cs.CV

DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts

Mingwei Xing , Xinliang Wang , Yifeng Shi This is my paper

Pith reviewed 2026-05-17 22:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene understandingMixture of Expertsdomain adaptationtopology-aware routingsemantic segmentationself-supervised pre-trainingcross-domain generalizationpoint cloud processing

0 comments

The pith

DoReMi combines self-supervised structural pre-training with dynamic topology-guided expert routing to build a single model that handles 3D data from different sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that Mixture-of-Experts networks for 3D scene understanding can overcome routing bias when data shares semantics but differs in topology due to sensor variations. It introduces a self-supervised pre-training branch that learns from topological and texture changes to fix cross-domain structural priors in place. A second branch then uses spatial context to guide routing and entropy measures to decide how many experts activate at each step. Together these allow one set of weights to extract general features while adapting locally to topology shifts. The result is reported gains on indoor segmentation benchmarks that previously required separate models per domain.

Core claim

Through the synergy of these dual branches, DoReMi achieves a deep integration of universal feature extraction and highly adaptive expert allocation. It achieves 80.1% mIoU on the ScanNet validation set and 77.2% mIoU on S3DIS, comprehensively outperforming existing state-of-the-art methods.

What carries the argument

Dual-branch architecture: a self-supervised multi-attribute pre-training branch that anchors topological and texture priors, paired with a domain-aware expert branch using Domain Spatial-Guided Routing to sense local topology and Entropy-controlled Dynamic Allocation to adjust active experts by routing uncertainty.

If this is right

A single set of weights can now process both indoor and outdoor 3D scenes without separate domain-specific fine-tuning.
Expert routing decisions become sensitive to local spatial structure rather than semantics alone, reducing misallocation on heterogeneous inputs.
Dynamic control of expert activation via routing entropy keeps training stable while still allowing adaptation to varying topology.
The same dual-branch pattern yields measurable lifts on semantic segmentation across multiple public 3D benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural-prior anchoring could be tested on 3D tasks beyond segmentation, such as detection or reconstruction, where topology also varies by sensor.
MoE designs in other fields that face heterogeneous inputs might reduce routing bias by adding an early self-supervised branch focused on domain-specific structure.
If the entropy-controlled allocation proves robust, it offers a practical knob for trading compute against accuracy on edge devices that ingest mixed 3D streams.

Load-bearing premise

That self-supervised training on topological and texture variations can anchor cross-domain structural priors strongly enough to override semantics-driven routing bias in existing 3D MoE networks.

What would settle it

Train the model on a fresh collection of 3D scenes that keep semantic labels consistent but introduce new topological shifts from unseen sensor setups, then measure whether ablating the pre-training branch or the spatial-routing and entropy mechanisms eliminates the reported gains over standard MoE baselines.

Figures

Figures reproduced from arXiv: 2511.11232 by Mingwei Xing, Xinliang Wang, Yifeng Shi.

**Figure 1.** Figure 1: Performance of DoReMi. DoReMi advances multiple state-of-the-art benchmarks on 3D scene understanding datasets. verse 3D datasets [2, 8, 55], 3D deep learning has achieved remarkable progress in tasks such as semantic segmentation, object detection, and other 3D understanding applications [6, 23, 48]. However, compared with 2D vision tasks, the 3D vision field still faces the dual challenges of limited… view at source ↗

**Figure 2.** Figure 2: The architecture of DoReMi. It combines two branches: a domain-aware branch (Do) and a unified-representation branch (Re). DoReMi starts with a pretrained model generated through self-supervised learning on multi-attribute data. Re freezes this pretrained knowledge to extract cross-domain geometric and structural patterns. Meanwhile, Do uses Domain-Guided Spatial Routing (DSR) and Entropy-Controlled Dynami… view at source ↗

**Figure 3.** Figure 3: Qualitative analysis. Visualization of different methods on ScanNet [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Expert utilization rates across different datasets. Taking Encoder Stage 2 as an example, y-axis values represent dataset-specific normalized utilization. More distribution results can be found in the supplementary materials. mentary material. In S3DIS, Experts 0 and 4 are activated more frequently, suggesting they specialize to the dataset’s spatial and semantic patterns. In Structured3D, activations are … view at source ↗

read the original abstract

Constructing a unified 3D scene understanding model has long been hindered by the significant topological discrepancies across different sensor modalities. While applying the Mixture-of-Experts (MoE) architecture is an effective approach to achieving universal understanding, we observe that existing 3D MoE networks often suffer from semantics-driven routing bias. This makes it challenging to address cross-domain data characterized by "semantic consistency yet topological heterogeneity." To overcome this challenge, we propose DoReMi (Topology-Aware Domain-Representation Mixture of Experts). Specifically, we introduce a self-supervised pre-training branch based on multi attributes, such as topological and texture variations, to anchor cross-domain structural priors. Building upon this, we design a domain-aware expert branch comprising two core mechanisms: Domain Spatial-Guided Routing (DSR), which achieves an acute perception of local topological variations by extracting spatial contexts, and Entropy-controlled Dynamic Allocation (EDA), which dynamically adjusts the number of activated experts by quantifying routing uncertainty to ensure training stability. Through the synergy of these dual branches, DoReMi achieves a deep integration of universal feature extraction and highly adaptive expert allocation. Extensive experiments across various tasks, encompassing both indoor and outdoor scenes, validate the superiority of DoReMi. It achieves 80.1% mIoU on the ScanNet validation set and 77.2% mIoU on S3DIS, comprehensively outperforming existing state-of-the-art methods. The code will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DoReMi adds targeted routing fixes to 3D MoE for cross-sensor data, with reported gains on ScanNet and S3DIS that look plausible but rest on unexamined ablations.

read the letter

The main point is that this paper takes standard MoE for 3D scenes and adds a self-supervised pre-training branch plus two routing tweaks to handle cases where data shares semantics but differs in topology across sensors. The new pieces are Domain Spatial-Guided Routing, which pulls spatial context to detect local topological shifts, and Entropy-controlled Dynamic Allocation, which changes the number of active experts according to routing uncertainty. These sit on top of a multi-attribute pre-training step meant to anchor structural priors. That combination is the actual novelty here, and it directly targets the routing bias the authors flag in prior 3D MoE work. The reported results—80.1% mIoU on ScanNet validation and 77.2% on S3DIS—show it beating existing methods on both indoor and outdoor tasks, which is the strongest evidence they offer. The framing of the problem is clear and the mechanisms are described without obvious internal contradictions. The soft spots are in the experimental support. The abstract gives headline numbers but leaves out ablation details, baseline choices, and variance, so it is still unclear how much the new routing components actually move the needle versus other factors. The assumption that semantics-driven bias is the dominant issue also needs tighter validation in the full experiments. This work is aimed at people building unified 3D models for robotics or spatial AI who already use MoE-style architectures. A reader focused on practical domain adaptation in vision would find the routing ideas worth discussing. It is coherent enough on its own terms to deserve a serious referee, even if revisions will be needed to strengthen the empirical case.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DoReMi, a topology-aware Mixture-of-Experts architecture for unified 3D scene understanding across sensor modalities with semantic consistency but topological heterogeneity. It proposes a dual-branch design consisting of a self-supervised multi-attribute pre-training branch to anchor cross-domain structural priors and a domain-aware expert branch that incorporates Domain Spatial-Guided Routing (DSR) for local topological perception via spatial contexts and Entropy-controlled Dynamic Allocation (EDA) for uncertainty-aware expert activation. The authors report state-of-the-art results of 80.1% mIoU on the ScanNet validation set and 77.2% mIoU on S3DIS, with code release planned.

Significance. If the reported gains hold under rigorous validation, the work offers a practical advance toward universal 3D models by explicitly targeting semantics-driven routing bias in MoE networks. The integration of self-supervised structural priors with adaptive routing mechanisms is a coherent response to cross-domain topological discrepancies, and the explicit plan to release code is a clear strength for reproducibility and follow-on research.

major comments (2)

[§4.1 and Table 2] §4.1 and Table 2: the central claim of comprehensive outperformance relies on the dual-branch synergy, yet the ablation study does not isolate the contribution of the self-supervised pre-training branch versus the DSR/EDA components with statistical significance testing or multiple random seeds; this weakens the attribution of the 80.1% and 77.2% mIoU figures to the proposed mechanisms rather than standard MoE scaling.
[§3.2] §3.2, description of EDA: the entropy-controlled dynamic allocation is presented as ensuring training stability by quantifying routing uncertainty, but the manuscript does not specify the exact threshold or functional form used to adjust the number of activated experts (e.g., whether it is a fixed entropy cutoff or a learned schedule), leaving the mechanism under-specified for reproduction.

minor comments (2)

[Figure 3] Figure 3: the visualization of expert activation patterns would be clearer with an additional panel showing routing entropy values across domains to directly illustrate the EDA mechanism.
[§2] §2: the motivation section would benefit from a brief quantitative comparison (e.g., routing bias statistics) on a small cross-domain subset to ground the claim that existing 3D MoE networks suffer primarily from semantics-driven bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment point by point below.

read point-by-point responses

Referee: [§4.1 and Table 2] §4.1 and Table 2: the central claim of comprehensive outperformance relies on the dual-branch synergy, yet the ablation study does not isolate the contribution of the self-supervised pre-training branch versus the DSR/EDA components with statistical significance testing or multiple random seeds; this weakens the attribution of the 80.1% and 77.2% mIoU figures to the proposed mechanisms rather than standard MoE scaling.

Authors: We thank the referee for this observation. Our ablation studies in §4.1 and Table 2 progressively incorporate the self-supervised pre-training branch followed by the DSR and EDA components to demonstrate their individual and combined contributions. However, we agree that the absence of multiple random seeds and statistical significance testing limits the strength of attribution. In the revised manuscript we will report mean and standard deviation over at least three independent runs with different seeds and include paired statistical tests (e.g., t-tests) against the baseline MoE variants to better isolate the effect of the proposed mechanisms. revision: yes
Referee: [§3.2] §3.2, description of EDA: the entropy-controlled dynamic allocation is presented as ensuring training stability by quantifying routing uncertainty, but the manuscript does not specify the exact threshold or functional form used to adjust the number of activated experts (e.g., whether it is a fixed entropy cutoff or a learned schedule), leaving the mechanism under-specified for reproduction.

Authors: We agree that the current description of EDA in §3.2 lacks the precise implementation details required for reproduction. We will revise this section to explicitly state the functional form, including the entropy threshold, the rule for increasing the number of activated experts, and whether the schedule is fixed or adaptive, together with the corresponding pseudocode or equations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel components on standard MoE base

full rationale

The derivation introduces a self-supervised multi-attribute pre-training branch to anchor structural priors and a domain-aware expert branch with DSR (spatial-guided routing) and EDA (entropy-controlled allocation). These are presented as new mechanisms addressing observed semantics-driven routing bias in existing 3D MoE networks. Performance numbers (80.1% mIoU ScanNet, 77.2% S3DIS) are empirical outcomes, not reductions by construction. No equations or self-citations in the provided text reduce the central claims to fitted inputs, self-definitions, or prior author uniqueness theorems. Minor self-citation risk at most, but central argument remains independent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review is limited to the abstract; no explicit free parameters, background axioms, or independently evidenced new entities are detailed in the provided text.

invented entities (2)

Domain Spatial-Guided Routing (DSR) no independent evidence
purpose: Achieves acute perception of local topological variations by extracting spatial contexts
Core mechanism introduced in the domain-aware expert branch
Entropy-controlled Dynamic Allocation (EDA) no independent evidence
purpose: Dynamically adjusts the number of activated experts by quantifying routing uncertainty to ensure training stability
Core mechanism introduced in the domain-aware expert branch

pith-pipeline@v0.9.0 · 5569 in / 1290 out tokens · 56932 ms · 2026-05-17T22:09:08.345733+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we introduce a self-supervised pre-training branch based on multi attributes, such as topological and texture variations, to anchor cross-domain structural priors... Domain Spatial-Guided Routing (DSR), which achieves an acute perception of local topological variations by extracting spatial contexts
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entropy-controlled Dynamic Allocation (EDA), which dynamically adjusts the number of activated experts by quantifying routing uncertainty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling
cs.CV 2026-04 unverdicted novelty 5.0

PLOVIS generates pseudo-labels for 3D point clouds by rendering them into 2D images and applying open-vocabulary segmentation, then filters the labels and uses a class-balanced memory bank to train effective models wi...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InCVPR, 2016. 5

work page 2016
[2]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 1, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 6

work page 2020
[4]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 4

work page 2020
[5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, 2024. 1

work page 2024
[7]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InCVPR, 2019. 5

work page 2019
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 1, 5

work page 2017
[9]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 1

work page 2009
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022. 2, 4

work page 2022
[12]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Bin- qiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InICCV, 2021. 5

work page 2021
[13]

Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024

Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, and Qi Tian. Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024. 2

work page arXiv 2024
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

work page
[15]

Pointinst3d: Segmenting 3d instances by points

Tong He, Wei Yin, Chunhua Shen, and Anton Van den Hen- gel. Pointinst3d: Segmenting 3d instances by points. In ECCV, 2022. 2

work page 2022
[16]

Exploring data-efficient 3d scene understanding with contrastive scene contexts

Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. InCVPR, 2021. 5

work page 2021
[17]

Adaptive mixtures of local experts.Neu- ral Comput., 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral Comput., 1991. 2

work page 1991
[18]

Pointgroup: Dual-set point grouping for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InCVPR, 2020. 2

work page 2020
[19]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InCVPR, pages 12697–12705, 2019. 2

work page 2019
[20]

Pamba: Enhancing global inter- action in point clouds via state space model

Zhuoyuan Li, Yubo Ai, Jiahao Lu, ChuXin Wang, Jiacheng Deng, Hanzhi Chang, Yanzhe Liang, Wenfei Yang, Shifeng Zhang, and Tianzhu Zhang. Pamba: Enhancing global inter- action in point clouds via state space model. InAAAI, 2025. 5

work page 2025
[21]

A convolutional decoder for point clouds using adaptive instance normaliza- tion

Isaak Lim, Moritz Ibing, and Leif Kobbelt. A convolutional decoder for point clouds using adaptive instance normaliza- tion. InComputer graphics forum, 2019. 1, 2

work page 2019
[22]

3d-moe: A mixture-of-experts multi-modal llm for 3d vi- sion and pose diffusion via rectified flow.arXiv preprint arXiv:2501.16698, 2025

Yueen Ma, Yuzheng Zhuang, Jianye Hao, and Irwin King. 3d-moe: A mixture-of-experts multi-modal llm for 3d vi- sion and pose diffusion via rectified flow.arXiv preprint arXiv:2501.16698, 2025. 2

work page arXiv 2025
[23]

Spatiallm: Train- ing large language models for structured indoor modeling

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Train- ing large language models for structured indoor modeling. In NeurIPS, 2025. 1, 5, 6, 7

work page 2025
[24]

Multimodal contrastive learn- ing with limoe: the language-image mixture of experts

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learn- ing with limoe: the language-image mixture of experts. NeurIPS, 2022. 2

work page 2022
[25]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023
[27]

Masked autoencoders for point cloud self-supervised learning

Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InECCV, 2022. 2

work page 2022
[28]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InCVPR, 2023. 2

work page 2023
[29]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017. 2 9

work page 2017
[30]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NeurIPS, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NeurIPS, 2017. 2

work page 2017
[31]

Pointnext: Revisiting pointnet++ with improved training and scaling strategies.NeurIPS, 2022

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.NeurIPS, 2022. 2

work page 2022
[32]

Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.NeurIPS, 2019

Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.NeurIPS, 2019. 1, 2

work page 2019
[33]

An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models

Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, and Liang Xiao. An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. InCVPR, 2025. 5

work page 2025
[34]

Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InICML, pages 18332–18346, 2022. 2

work page 2022
[35]

Scaling vision with sparse mix- ture of experts.NeurIPS, 2021

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts.NeurIPS, 2021. 2

work page 2021
[36]

& Jégou, H

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv ´e J ´egou. Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198, 2018. 4

work page arXiv 2018
[37]

Laion-5b: An open large-scale dataset for training next generation image-text models.NeurIPS, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.NeurIPS, 2022. 1

work page 2022
[38]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019. 1

work page 2019
[39]

Pv-rcnn: Point- voxel feature set abstraction for 3d object detection

Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. In CVPR, 2020. 2

work page 2020
[40]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InCVPR,

work page
[41]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.NeurIPS, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.NeurIPS, 2017. 3

work page 2017
[42]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Kpconv: Flexible and deformable convolution for point clouds

Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InICCV, 2019. 2

work page 2019
[44]

Hy- bridtm: Combining transformer and mamba for 3d semantic segmentation

Xinyu Wang, Jinghua Hou, Zhe Liu, and Yingying Zhu. Hy- bridtm: Combining transformer and mamba for 3d semantic segmentation. InIROS, 2025. 5

work page 2025
[45]

Point transformer v2: Grouped vector atten- tion and partition-based pooling.NeurIPS, 2022

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.NeurIPS, 2022. 2

work page 2022
[46]

Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning. InCVPR, 2023. 5

work page 2023
[47]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. In CVPR, 2024. 2, 5, 6, 8

work page 2024
[48]

Towards large- scale 3d representation learning with multi-dataset point prompt training

Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InCVPR, 2024. 1, 5, 6, 8

work page 2024
[49]

Sonata: Self- supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. In CVPR, 2025. 2, 3, 5, 6, 8

work page 2025
[50]

Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions

Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and Yifeng Shi. Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions. InCVPR, 2024. 1

work page 2024
[51]

Pointcontrast: Unsupervised pre- training for 3d point cloud understanding

Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InECCV, 2020. 2, 5

work page 2020
[52]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InCVPR, 2023. 2

work page 2023
[53]

Habitat-matterport 3d semantics dataset

Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakr- ishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. InCVPR,

work page
[54]

Geometry-guided do- main generalization for monocular 3d object detection

Fan Yang, Hui Chen, Yuwei He, Sicheng Zhao, Chenghao Zhang, Kai Ni, and Guiguang Ding. Geometry-guided do- main generalization for monocular 3d object detection. In AAAI, 2024. 1, 2

work page 2024
[55]

3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination. InCVPR, 2025. 1

work page 2025
[56]

Deepla-net: Very deep lo- cal aggregation networks for point cloud analysis

Ziyin Zeng, Mingyue Dong, Jian Zhou, Huan Qiu, Zhen Dong, Man Luo, and Bijun Li. Deepla-net: Very deep lo- cal aggregation networks for point cloud analysis. InCVPR,

work page
[57]

Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.NeurIPS, 2022

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.NeurIPS, 2022. 2

work page 2022
[58]

Uni3d-moe: Scalable multimodal 3d scene understanding via mixture of experts.arXiv preprint arXiv:2505.21079, 2025

Yue Zhang, Yingzhao Jian, Hehe Fan, Yi Yang, and Roger Zimmermann. Uni3d-moe: Scalable multimodal 3d scene understanding via mixture of experts.arXiv preprint arXiv:2505.21079, 2025. 2 10

work page arXiv 2025
[59]

Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations

Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao. Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations. InNeurIPS, 2025. 8

work page 2025
[60]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InICCV, 2021. 1

work page 2021
[61]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InECCV, 2020. 5

work page 2020
[62]

Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv
[63]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InCVPR, 2018. 2 11

work page 2018