pith. sign in

arxiv: 2604.07997 · v1 · submitted 2026-04-09 · 💻 cs.CV

Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

Pith reviewed 2026-05-10 16:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot learningincremental learning3D object detectionvision-language modelsindoor environmentsmultimodal fusionprototype imprintingunknown object mining
0
0 comments X

The pith

Few-shot incremental 3D detection works by mining unknown objects with vision-language models and fusing 2D-3D prototypes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FI3Det as a way to let 3D detectors pick up new object types in dynamic indoor spaces after seeing only a handful of examples. It relies on vision-language models to identify previously unseen objects and to pull out both 2D meaning and rough 3D shapes, then cleans those signals with location-based weighting before imprinting class prototypes. A gated fusion step combines the 2D and 3D information to decide on detections. This setup matters because real-world environments change and labeling every new object type is expensive, so efficient adaptation matters for robots or smart spaces. Tests on common indoor datasets confirm steady gains over standard incremental methods in both one-time and ongoing learning schedules.

Core claim

The central discovery is that a combination of VLM-guided unknown object mining, spatial and consistency-based feature weighting, and gated multimodal prototype imprinting allows effective few-shot incremental 3D object detection without requiring extensive annotations for novel classes, as demonstrated by consistent improvements on ScanNet V2 and SUN RGB-D datasets in batch and sequential settings.

What carries the argument

The gated multimodal prototype imprinting module constructs category prototypes from aligned 2D semantic and 3D geometric features and fuses their classification scores using a multimodal gating mechanism to detect novel objects.

If this is right

  • Detectors can add new classes with minimal additional labeling effort.
  • Unknown objects are perceived better already in the initial training phase.
  • Consistent gains appear in both batch addition of all new classes at once and sequential addition over time.
  • The framework defines standard evaluation protocols for this task on two common 3D indoor datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on external vision-language models suggests that further advances in those models would directly boost few-shot 3D detection performance.
  • This technique could be adapted for real-time robotic systems that encounter new objects during operation.
  • Extending the weighting and gating ideas to other sensor modalities like depth-only or LiDAR data might broaden its use.
  • If the noise reduction proves robust, the method could handle even noisier inputs from less capable vision-language models.

Load-bearing premise

Vision-language models are able to mine unknown objects reliably and generate 2D semantic features and class-agnostic 3D boxes that are not too noisy for the subsequent weighting and fusion modules to handle effectively.

What would settle it

Running the system without the VLM-guided unknown object learning module and checking if few-shot performance on novel classes falls back to levels achieved by standard incremental baselines on the ScanNet V2 or SUN RGB-D datasets.

Figures

Figures reproduced from arXiv: 2604.07997 by Jianjun Qian, Jian Yang, Jin Xie, Na Zhao, Yun Zhu.

Figure 1
Figure 1. Figure 1: Comparison between incremental 3D object detection [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Correlation between base and novel category objects. In [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our few-shot incremental 3D object detection model. The model consists of two parts: base training and incremental learning. In the base stage, we introduce a VLM-guided unknown object learning module that uses 2D VLMs to generate unknown objects, thereby improving the perception of unknown objects. In the incremental stage, we propose a gated multimodal prototype imprinting module that builds … view at source ↗
Figure 4
Figure 4. Figure 4: Visualization comparison of features. In (b), the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on the ScanNet V2 [ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of different components in UOM and UOW. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Statistical analysis of the number of instances for each [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on the ScanNet V2 [ [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison on the SUN RGB-D [ [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes FI3Det, the first framework for few-shot incremental 3D object detection in dynamic indoor environments. It uses a VLM-guided unknown object learning module in the base stage to mine unknown objects and extract 2D semantic features plus class-agnostic 3D bounding boxes from few novel samples. Noise in these representations is addressed by a weighting mechanism that re-weights point- and box-level features according to spatial location and feature consistency. A gated multimodal prototype imprinting module then builds category prototypes from aligned 2D/3D features and fuses classification scores via multimodal gating. New batch and sequential evaluation protocols are introduced on ScanNet V2 and SUN RGB-D, with reported strong and consistent gains over baselines; code is released.

Significance. If the results hold, the work would be significant as the first dedicated approach to few-shot incremental 3D detection, lowering annotation costs for novel classes in embodied indoor settings. Establishing batch and sequential protocols on standard datasets is a useful contribution to evaluation methodology. The code release is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. The central performance claims rest on the assumption that VLMs produce sufficiently accurate class-agnostic 3D boxes and low-noise 2D semantic features from few-shot indoor samples; the weighting and gated-imprinting modules are then asserted to clean residual noise. No quantitative metrics (e.g., mining precision/recall, feature noise statistics, or VLM output quality on ScanNet V2 / SUN RGB-D) are supplied to show that these downstream steps can recover from typical VLM domain-shift errors, leaving the reported gains unsubstantiated.
  2. The abstract states that FI3Det 'achieves strong and consistent improvements over baseline methods' in both batch and sequential settings, yet supplies no numerical results, ablation tables, or error analysis. Without these, it is impossible to verify whether gains are attributable to the proposed weighting and gating components or to other factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of FI3Det as the first dedicated framework for few-shot incremental 3D object detection, along with the value of the new evaluation protocols and code release. We address each major comment below.

read point-by-point responses
  1. Referee: The central performance claims rest on the assumption that VLMs produce sufficiently accurate class-agnostic 3D boxes and low-noise 2D semantic features from few-shot indoor samples; the weighting and gated-imprinting modules are then asserted to clean residual noise. No quantitative metrics (e.g., mining precision/recall, feature noise statistics, or VLM output quality on ScanNet V2 / SUN RGB-D) are supplied to show that these downstream steps can recover from typical VLM domain-shift errors, leaving the reported gains unsubstantiated.

    Authors: We agree that explicit quantitative validation of the VLM mining step would further substantiate the claims. In the revised manuscript we will add precision/recall metrics for unknown-object mining on both ScanNet V2 and SUN RGB-D, together with before/after statistics on feature consistency and noise levels. These additions will directly illustrate how the spatial- and consistency-based weighting recovers from typical VLM domain-shift errors. The existing ablation studies already isolate the contribution of the weighting and gated-imprinting modules by showing performance degradation when either component is removed. revision: yes

  2. Referee: The abstract states that FI3Det 'achieves strong and consistent improvements over baseline methods' in both batch and sequential settings, yet supplies no numerical results, ablation tables, or error analysis. Without these, it is impossible to verify whether gains are attributable to the proposed weighting and gating components or to other factors.

    Authors: Abstracts are subject to strict length limits and therefore omit detailed tables and analyses. The full manuscript already contains the requested material: quantitative results for both batch and sequential protocols (Tables 1–2), component ablations (Table 3), and error analysis (Section 4.3). These tables and figures explicitly attribute the observed gains to the weighting and gating modules. To address the concern, we will revise the abstract to include a small number of concrete improvement figures (e.g., mAP deltas) while remaining within the word limit. revision: partial

Circularity Check

0 steps flagged

No circularity; framework and evaluations are self-contained

full rationale

The paper introduces FI3Det as a novel framework using VLM-guided unknown object mining, spatial/feature weighting, and gated multimodal prototype imprinting for few-shot incremental 3D detection. No equations, derivations, or fitted parameters are shown that reduce performance claims to quantities defined by the method's own inputs. New batch/sequential evaluation protocols on ScanNet V2 and SUN RGB-D are presented as external benchmarks without circular dependence on internal definitions or self-citations. Central claims rest on empirical gains over baselines rather than tautological reductions or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5570 in / 1198 out tokens · 32303 ms · 2026-05-10T16:53:46.299457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    Generalized Few- shot 3D Point Cloud Segmentation with Vision-Language Model

    Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, and Serge Belongie. Generalized Few- shot 3D Point Cloud Segmentation with Vision-Language Model. InCVPR, 2025. 3

  2. [2]

    V oxelNeXt: Fully Sparse V oxelNet for 3D Object Detection and Tracking

    Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelNeXt: Fully Sparse V oxelNet for 3D Object Detection and Tracking. InCVPR, 2023. 1, 2

  3. [3]

    Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration

    Yiyang Chen, Tianyu Ding, Lei Wang, Jing Huo, Yang Gao, and Wenbin Li. Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration. InCVPR, 2025. 2

  4. [4]

    YOLO-World: Real-Time Open-V ocabulary Object Detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. YOLO-World: Real-Time Open-V ocabulary Object Detection. InCVPR, 2024. 3, 10, 12

  5. [5]

    AIC3DOD: Advancing Indoor Class- Incremental 3D Object Detection with Point Transformer Architecture and Room Layout Constraints

    Zhongyao Cheng, Fang Wu, Peisheng Qian, Ziyuan Zhao, and Xulei Yang. AIC3DOD: Advancing Indoor Class- Incremental 3D Object Detection with Point Transformer Architecture and Room Layout Constraints. InWACV, 2025. 1, 2, 6, 7, 12

  6. [6]

    4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

    Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. InCVPR, 2019. 2

  7. [7]

    MMDetection3D: Open- MMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/ mmdetection3d, 2020

    MMDetection3D Contributors. MMDetection3D: Open- MMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/ mmdetection3d, 2020. 6

  8. [8]

    Spconv: Spatially sparse convolution library.https : / / github

    Spconv Contributors. Spconv: Spatially sparse convolution library.https : / / github . com / traveller59 / spconv, 2022. 2

  9. [9]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In CVPR, 2017. 2, 6, 7, 8, 10, 11, 12

  10. [10]

    V oxel R-CNN: Towards High Performance V oxel-based 3D Object Detection

    Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel R-CNN: Towards High Performance V oxel-based 3D Object Detection. In AAAI, 2021. 1

  11. [11]

    Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning

    Na Dong, Yongqiang Zhang, Mingli Ding, and Gim Hee Lee. Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning. InAAAI, 2023. 2, 3, 6, 7, 12

  12. [12]

    DQS3D: Densely-matched Quantization- aware Semi-supervised 3D Detection

    Huan-ang Gao, Beiwen Tian, Pengfei Li, Hao Zhao, and Guyue Zhou. DQS3D: Densely-matched Quantization- aware Semi-supervised 3D Detection. InICCV, 2023. 3

  13. [13]

    Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection

    Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, and Hanwang Zhang. Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection. InAAAI, 2024. 1

  14. [14]

    Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

    Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, and Yi-Hsuan Tsai. Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection. InNeurIPS, 2024. 3

  15. [15]

    Learning Superpoint Graph Cut for 3D Instance Segmentation

    Le Hui, Linghua Tang, Yaqi Shen, Jin Xie, and Jian Yang. Learning Superpoint Graph Cut for 3D Instance Segmentation. InNeurIPS, 2022. 2

  16. [16]

    Efficient LiDAR Point Cloud Oversegmentation Network

    Le Hui, Linghua Tang, Yuchao Dai, Jin Xie, and Jian Yang. Efficient LiDAR Point Cloud Oversegmentation Network. In ICCV, 2023

  17. [17]

    Sampling network guided cross-entropy method for unsupervised point cloud registration

    Haobo Jiang, Yaqi Shen, Jin Xie, Jun Li, Jianjun Qian, and Jian Yang. Sampling network guided cross-entropy method for unsupervised point cloud registration. InICCV, 2021

  18. [18]

    SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation

    Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, and Jian Yang. SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation. In NeurIPS, 2023. 2

  19. [19]

    Revisiting Pool-based Prompt Learning for Few-shot Class- incremental Learning

    Yongwei Jiang, Yixiong Zou, Yuhua Li, and Ruixuan Li. Revisiting Pool-based Prompt Learning for Few-shot Class- incremental Learning. InCVPR, 2025. 2

  20. [20]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment Anything. InICCV, 2023. 2, 3

  21. [21]

    SS3D: Sparsely-Supervised 3D Object Detection from Point Cloud

    Chuandong Liu, Chenqiang Gao, Fangcen Liu, Jiang Liu, Deyu Meng, and Xinbo Gao. SS3D: Sparsely-Supervised 3D Object Detection from Point Cloud. InCVPR, 2022. 3

  22. [22]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In ECCV, 2024. 2, 3, 4, 6, 10, 12

  23. [23]

    SEC-Prompt: SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning

    Ye Liu and Meng Yang. SEC-Prompt: SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning. InCVPR, 2025. 2, 3

  24. [24]

    Continual Detection Transformer for Incremen- tal Object Detection

    Yaoyao Liu, Bernt Schiele, Andrea Vedaldi, and Christian Rupprecht. Continual Detection Transformer for Incremen- tal Object Detection. InCVPR, 2023. 2

  25. [25]

    Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. SpatialLM: Training Large Language Models for Structured Indoor Modeling.arXiv preprint arXiv:2506.07491, 2025. 7

  26. [26]

    Weakly Supervised 3D Object Detection from Lidar Point Cloud

    Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Luc Van Gool, and Dengxin Dai. Weakly Supervised 3D Object Detection from Lidar Point Cloud. InECCV,

  27. [27]

    V-net: Fully convolutional neural networks for columetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for columetric medical image segmentation. In3DV, 2016. 9

  28. [28]

    How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation

    Yining Pan, Qiongjie Cui, Xulei Yang, and Na Zhao. How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation. In ICML, 2025. 2

  29. [29]

    Incremental Few-Shot Object Detection

    Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M Hospedales, and Tao Xiang. Incremental Few-Shot Object Detection. InCVPR, 2020. 2, 3

  30. [30]

    PointNet: Deep learning on Point sets for 3D Classification and Segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on Point sets for 3D Classification and Segmentation. InCVPR, 2017. 2

  31. [31]

    PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. InNeurIPS, 2017. 2

  32. [32]

    Deep Hough V oting for 3D Object Detection in Point Clouds

    Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep Hough V oting for 3D Object Detection in Point Clouds. InICCV, 2019. 2

  33. [33]

    Low-shot Learning with Imprinted Weights

    Hang Qi, Matthew Brown, and David G Lowe. Low-shot Learning with Imprinted Weights. InCVPR, 2018. 4, 6, 7, 12

  34. [34]

    FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection

    Anna Rukhovich, Anna V orontsova, and Anton Konushin. FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection. InECCV, 2022. 1, 2, 5, 6

  35. [35]

    TR3D: Towards Real-Time Indoor 3D Object Detection

    Danila Rukhovich, Anna V orontsova, and Anton Konushin. TR3D: Towards Real-Time Indoor 3D Object Detection. In ICIP, 2023. 1, 2, 5, 6, 7

  36. [36]

    V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

    Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, and Baining Guo. V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection. InICLR, 2024. 2

  37. [37]

    Lichtenberg, and Jianxiong Xiao

    Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. InCVPR, 2015. 2, 6, 7, 8, 10, 11, 12, 13

  38. [38]

    Prototypical Variational Autoencoder for 3D Few-shot Object Detection

    Weiliang Tang, Biqi Yang, Xianzhi Li, Pheng-Ann Heng, Yunhui Liu, and Chi-Wing Fu. Prototypical Variational Autoencoder for 3D Few-shot Object Detection. InNeurIPS,

  39. [39]

    3DIoUMatch: Leveraging IoU Prediction for Semi- Supervised 3D Object Detection

    He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J Guibas. 3DIoUMatch: Leveraging IoU Prediction for Semi- Supervised 3D Object Detection. InCVPR, 2021. 3

  40. [40]

    CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds

    Haiyang Wang, Shaocong Dong, Shaoshuai Shi, Aoxue Li, Jianan Li, Zhenguo Li, Liwei Wang, et al. CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds. InNeurIPS, 2022. 1

  41. [41]

    Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection

    Jiangyi Wang and Na Zhao. Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection. InCVPR, 2025. 1

  42. [42]

    AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

    Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, and Na Zhao. AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models. In NeurIPS, 2025. 2

  43. [43]

    AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

    Xinyi Wang, Na Zhao, Zhiyuan Han, Dan Guo, and Xun Yang. AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring. InAAAI, 2025. 2

  44. [44]

    Syn-to- Real Unsupervised Domain Adaptation for Indoor 3D Object Detection

    Yunsong Wang, Na Zhao, and Gim Hee Lee. Syn-to- Real Unsupervised Domain Adaptation for Indoor 3D Object Detection. InBMVC, 2024. 3

  45. [45]

    One for All: Multi-Domain Joint Training for Point Cloud Based 3D Object Detection

    Zhenyu Wang, Ya-Li Li, Hengshuang Zhao, and Shengjin Wang. One for All: Multi-Domain Joint Training for Point Cloud Based 3D Object Detection. InNeurIPS, 2024. 2

  46. [46]

    Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

    Yang Wu, Kaihua Zhang, Jianjun Qian, Jin Xie, and Jian Yang. Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer. InECCV,

  47. [47]

    WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion

    Yang Wu, Yun Zhu, Kaihua Zhang, Jianjun Qian, Jin Xie, and Jian Yang. WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion. InCVPR, 2025. 2

  48. [48]

    CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection.arXiv preprint arXiv:2603.23276, 2026

    Yuchen Wu, Kun Wang, Yining Pan, and Na Zhao. CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection.arXiv preprint arXiv:2603.23276, 2026. 1

  49. [49]

    NaviFormer: A Spatio-Temporal Context-Aware Transformer for Object Navigation

    Wei Xie, Haobo Jiang, Yun Zhu, Jianjun Qian, and Jin Xie. NaviFormer: A Spatio-Temporal Context-Aware Transformer for Object Navigation. InAAAI, 2025. 2

  50. [50]

    EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

    Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, and Vikas Chandra. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. InCVPR, 2024. 2, 3, 4, 7

  51. [51]

    Back to Reality: Weakly-supervised 3D Object Detection with Shape-guided Label Enhancement

    Xiuwei Xu, Yifan Wang, Yu Zheng, Yongming Rao, Jie Zhou, and Jiwen Lu. Back to Reality: Weakly-supervised 3D Object Detection with Shape-guided Label Enhancement. In CVPR, 2022. 3

  52. [52]

    Mixsup: Mixed-Grained Supervision for Label-Efficient Lidar-based 3D Object Detection

    Yuxue Yang, Lue Fan, and Zhaoxiang Zhang. Mixsup: Mixed-Grained Supervision for Label-Efficient Lidar-based 3D Object Detection. InICLR, 2024. 3, 7

  53. [53]

    Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection

    Li Yin, Juan M Perez-Rua, and Kevin J Liang. Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection. InCVPR, 2022. 2, 3

  54. [54]

    General Geometry-aware Weakly Supervised 3D Object Detection

    Guowen Zhang, Junsong Fan, Liyi Chen, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. General Geometry-aware Weakly Supervised 3D Object Detection. InECCV, 2024. 3, 7

  55. [55]

    Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning

    Li-Jun Zhao, Zhen-Duo Chen, Yongxin Wang, Xin Luo, and Xin-Shun Xu. Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning. InCVPR, 2025. 2, 3

  56. [56]

    Static-Dynamic Co-teaching for Class-Incremental 3D Object Detection

    Na Zhao and Gim Hee Lee. Static-Dynamic Co-teaching for Class-Incremental 3D Object Detection. InAAAI, 2022. 1, 2, 6, 10

  57. [57]

    SESS: Self- Ensembling Semi-Supervised 3D Object Detection

    Na Zhao, Tat-Seng Chua, and Gim Hee Lee. SESS: Self- Ensembling Semi-Supervised 3D Object Detection. In CVPR, 2020. 3

  58. [58]

    SDCoT++: Improved Static-Dynamic Co- Teaching for Class-Incremental 3D Object Detection.IEEE Transactions on Image Processing, 2025

    Na Zhao, Peisheng Qian, Fang Wu, Xun Xu, Xulei Yang, and Gim Hee Lee. SDCoT++: Improved Static-Dynamic Co- Teaching for Class-Incremental 3D Object Detection.IEEE Transactions on Image Processing, 2025. 1, 2, 6, 7, 11, 12

  59. [59]

    Prototypical V oteNet for Few-Shot 3D Point Cloud Object Detection

    Shizhen Zhao and Xiaojuan Qi. Prototypical V oteNet for Few-Shot 3D Point Cloud Object Detection. InNeurIPS,

  60. [60]

    SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

    Shijia Zhao, Qiming Xia, Xusheng Guo, Pufan Zou, Maoji Zheng, Hai Wu, Chenglu Wen, and Cheng Wang. SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts. InCVPR, 2025. 3

  61. [61]

    Distance-IoU loss: Faster and Better Learning for Bounding Box Regression

    Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU loss: Faster and Better Learning for Bounding Box Regression. InAAAI, 2020. 9

  62. [62]

    MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation.IEEE Robotics and Automation Letters, 10 (11):11832–11839, 2025

    Kangjian Zhu, Haobo Jiang, Yigong Zhang, Jianjun Qian, Jian Yang, and Jin Xie. MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation.IEEE Robotics and Automation Letters, 10 (11):11832–11839, 2025. 2

  63. [63]

    SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection

    Yun Zhu, Le Hui, Yaqi Shen, and Jin Xie. SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection. InAAAI, 2024. 2

  64. [64]

    Learning Class Prototypes for Unified Sparse- Supervised 3D Object Detection

    Yun Zhu, Le Hui, Hang Yang, Jianjun Qian, Jin Xie, and Jian Yang. Learning Class Prototypes for Unified Sparse- Supervised 3D Object Detection. InCVPR, 2025. 3