PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

Chaoqun Zheng; Dingkang Liang; Siyuan Liu; Tianrui Feng; Xiang Bai; Xin Zhou

arxiv: 2604.04933 · v1 · submitted 2026-04-06 · 💻 cs.CV

PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

Siyuan Liu , Chaoqun Zheng , Xin Zhou , Tianrui Feng , Dingkang Liang , Xiang Bai This is my paper

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene understandingpoint cloud segmentationtest-time adaptationdynamic parametersparameter-efficient fine-tuningsemantic segmentationPTv3 backbone

0 comments

The pith

PointTPA generates input-aware parameters for local patches in 3D point clouds, raising ScanNet mIoU to 78.4 percent with under 2 percent added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a network can improve its handling of varied 3D scenes by creating fresh weights on the fly for each local patch instead of keeping one fixed set of parameters throughout inference. Standard backbones struggle because real scenes differ in geometry, object balance, and layout, yet most methods lock the weights after training. PointTPA adds two small modules that first group points into coherent patches and then project scene-specific weights for those patches, letting the model adjust its behavior without retraining the whole network. A sympathetic reader would care because this keeps the model small and fast while making it more responsive to the unpredictable structure of real environments such as rooms or streets.

Core claim

PointTPA is a test-time parameter adaptation framework that uses Serialization-based Neighborhood Grouping to form locally coherent patches from input point clouds and a Dynamic Parameter Projector to produce patch-wise adaptive weights; when integrated into the PTv3 backbone these two lightweight modules, together less than 2 percent of the original parameters, enable the network to adjust its behavior to scene-specific variations and reach 78.4 percent mIoU on ScanNet validation while outperforming prior parameter-efficient fine-tuning approaches on multiple benchmarks.

What carries the argument

The Dynamic Parameter Projector, which takes patch features from Serialization-based Neighborhood Grouping and outputs custom network weights for each patch so the backbone can change its computation according to the current scene.

If this is right

The backbone maintains strong performance on ScanNet validation while the added modules stay below 2 percent of its parameter count.
The same modules surpass existing parameter-efficient fine-tuning methods across several 3D scene benchmarks.
The network adjusts its internal behavior to each scene's geometry and layout during inference without any additional training pass.
Local patch grouping followed by per-patch weight generation keeps the adaptation both spatially coherent and computationally light.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same patch-wise adaptation idea could be tested on outdoor LiDAR data where scene layouts change even more abruptly than in indoor scans.
If the projector proves stable, future models might replace heavy pre-training on mixed datasets with lightweight on-the-fly adjustment for each new environment.
The approach hints that conditional weight generation may be more efficient than adding more layers or channels when the goal is robustness to scene diversity.

Load-bearing premise

The patch-wise parameters produced by the Dynamic Parameter Projector will improve results on diverse scenes without introducing instability or requiring scene-specific tuning that was not disclosed.

What would settle it

Running PointTPA on a new collection of indoor scenes with deliberately varied layouts and measuring whether mIoU falls below the static PTv3 baseline or fluctuates sharply when the projector is replaced by random weights of the same size.

Figures

Figures reproduced from arXiv: 2604.04933 by Chaoqun Zheng, Dingkang Liang, Siyuan Liu, Tianrui Feng, Xiang Bai, Xin Zhou.

**Figure 1.** Figure 1: (a) Scene-level point clouds have more points and highly imbalanced category distributions compared to object-level point clouds. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of PointTPA. It consists of a Serialization-based Neighborhood Grouping (SNG) and a Dynamic Parameter Projector [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of our mixed-insertion strategy. PointTPA [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of PEFT methods and FFT on ScanNet [ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the similarity of dynamic weights. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison against IDPT [58]. Green and red boxes indicate correct and incorrect segmentations, respectively, with GT denoting the ground truth. timal balance between representational capacity and training stability, we evaluate various DPP insertion strategies (Tab. 5). Notably, a dense DPP configuration introduces redundant parameters and degrades performance, reducing mIoU by 0.6% and allA… view at source ↗

**Figure 1.** Figure 1: A comparison of FFT, IDPT [58], DAPT [69], and PointTPA on segmentation performance, evaluated on ScanNet [8]. (a) (b) (c) (d) [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

**Figure 2.** Figure 2: More visualizations of the semantic segmentation results of our PointTPA on four large-scale scene datasets. (a) ScanNet [ [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone's parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at https://github.com/H-EmbodVis/PointTPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PointTPA adds lightweight test-time adaptation via patch grouping and a dynamic projector to PTv3, hitting 78.4% mIoU with under 2% extra parameters, but the isolated value of the dynamic part still needs clearer evidence.

read the letter

The core contribution is a test-time adaptation setup for scene-level point cloud segmentation. Serialization-based Neighborhood Grouping turns the input into locally coherent patches, and the Dynamic Parameter Projector then produces patch-specific weights for the backbone on the fly. Integrated with PTv3, the two modules add less than 2% parameters yet reach 78.4% mIoU on ScanNet validation and beat prior PEFT baselines on several benchmarks. Code release is a plus for anyone wanting to try it.

Referee Report

3 major / 2 minor

Summary. The paper proposes PointTPA, a test-time parameter adaptation framework for scene-level 3D point cloud understanding. It introduces Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches from point clouds and a Dynamic Parameter Projector (DPP) to generate input-aware patch-wise weights that adapt the PTv3 backbone to scene-specific geometry and layout variations. The two modules add less than 2% parameters to the backbone; the method reports 78.4% mIoU on ScanNet validation and outperforms existing parameter-efficient fine-tuning (PEFT) approaches across multiple benchmarks.

Significance. If the dynamic adaptation mechanism proves robust, the work would offer a practical route to parameter-efficient handling of diverse 3D scenes without retraining or large overhead, addressing a real limitation of static networks in scene understanding. The low parameter count and code release are positive for reproducibility and deployment.

major comments (3)

[Experimental results] Experimental results: the headline 78.4% mIoU on ScanNet validation is presented without an ablation that isolates the contribution of the Dynamic Parameter Projector (DPP) from the Serialization-based Neighborhood Grouping (SNG) alone; this is load-bearing for the central claim that input-aware dynamic weights drive the improvement.
[Experimental results] Experimental results: no per-scene or per-category variance statistics or stability analysis is reported for the patch-wise parameters produced by DPP, leaving the claim of reliable adaptation across diverse geometries and layouts unverified.
[Method] Method description: the manuscript provides no statement on whether the DPP projection weights or any adaptation step requires scene-dependent hyperparameter choices; if such tuning is present but undisclosed, the parameter-efficiency argument is weakened.

minor comments (2)

[Method] The abstract and method sections use the term 'test-time' but the precise inference-time procedure (e.g., whether DPP runs once per scene or per patch) should be clarified with a diagram or pseudocode.
[Experiments] Table captions and baseline descriptions should explicitly state whether all compared PEFT methods were trained with identical optimizer, schedule, and data augmentation settings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, agreeing where revisions are needed to strengthen the presentation of our results and claims.

read point-by-point responses

Referee: [Experimental results] Experimental results: the headline 78.4% mIoU on ScanNet validation is presented without an ablation that isolates the contribution of the Dynamic Parameter Projector (DPP) from the Serialization-based Neighborhood Grouping (SNG) alone; this is load-bearing for the central claim that input-aware dynamic weights drive the improvement.

Authors: We agree that an explicit ablation isolating the DPP from SNG would more directly support the central claim regarding the benefit of input-aware dynamic weights. Our current experiments demonstrate gains of the full PointTPA (SNG + DPP) over the PTv3 baseline and PEFT methods, but do not include this specific isolation. We will add the requested ablation in the revised manuscript, evaluating SNG with static parameters versus the full dynamic adaptation. revision: yes
Referee: [Experimental results] Experimental results: no per-scene or per-category variance statistics or stability analysis is reported for the patch-wise parameters produced by DPP, leaving the claim of reliable adaptation across diverse geometries and layouts unverified.

Authors: We acknowledge that variance and stability statistics would provide stronger verification of reliable adaptation. While overall benchmark improvements suggest robustness, such per-scene analysis was not included in the original submission. In the revision, we will incorporate per-scene and per-category variance statistics for the DPP-generated parameters along with basic stability metrics. revision: yes
Referee: [Method] Method description: the manuscript provides no statement on whether the DPP projection weights or any adaptation step requires scene-dependent hyperparameter choices; if such tuning is present but undisclosed, the parameter-efficiency argument is weakened.

Authors: We confirm that the DPP projection weights and all adaptation steps use fixed hyperparameters with no scene-dependent choices or tuning. These values were selected once on a validation set and held constant across all scenes. We will add an explicit clarifying statement in the method section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal with no load-bearing derivations

full rationale

The paper proposes PointTPA as a test-time adaptation framework consisting of Serialization-based Neighborhood Grouping (SNG) and Dynamic Parameter Projector (DPP) modules inserted into PTv3. All central claims are framed as empirical outcomes: the modules add <2% parameters and yield 78.4% mIoU on ScanNet validation while outperforming PEFT baselines. No equations, uniqueness theorems, fitted-parameter predictions, or self-citation chains are invoked that would reduce the reported gains to the inputs by construction. The derivation chain is therefore self-contained as an engineering contribution whose validity rests on external benchmark results rather than internal redefinition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of two newly introduced modules whose internal design choices (patch size, projector dimensions, serialization order) are not detailed in the abstract and are therefore treated as free parameters or domain assumptions.

free parameters (2)

SNG patch size and serialization parameters
Chosen to produce locally coherent patches; exact values not given in abstract but required for the grouping step.
DPP output dimension and projection weights
Determines the adaptive weights per patch; kept under 2% of backbone but still a design choice fitted for reported performance.

axioms (1)

domain assumption PTv3 backbone layers can accept and benefit from externally supplied patch-wise parameters without retraining the core weights.
Invoked when the paper states integration into PTv3 while keeping overhead low.

invented entities (2)

Serialization-based Neighborhood Grouping (SNG) no independent evidence
purpose: To form locally coherent patches from unordered point clouds
New module introduced by the paper; no independent evidence outside this work.
Dynamic Parameter Projector (DPP) no independent evidence
purpose: To generate input-aware network parameters for each patch
New module introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5512 in / 1493 out tokens · 40017 ms · 2026-05-10T18:40:11.450336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieves 78.4% mIoU on ScanNet validation... two lightweight modules of less than 2% of the backbone's parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

[1]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9902–9912,

work page
[2]

Randlora: full rank parameter-efficient fine- tuning of large models

Paul Albert, Frederic Z Zhang, Cristian Rodriguez-Opazo, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad. Randlora: full rank parameter-efficient fine- tuning of large models. 2024. 6, 12

work page 2024
[3]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1534–1543, 2016. 5, 6, 13

work page 2016
[4]

Clip2scene: Towards label-efficient 3d scene under- standing by clip

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. Clip2scene: Towards label-efficient 3d scene under- standing by clip. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 7020–7030, 2023. 3

work page 2023
[5]

V oxelnext: Fully sparse voxelnet for 3d object detection and tracking

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object detection and tracking. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21674–21683, 2023. 3

work page 2023
[6]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProc. IEEE Conf. Comput. Vis. Pattern Recog- nit., pages 3075–3084, 2019. 6

work page 2019
[7]

Pointcept: A codebase for point cloud perception research.https://github.com/ Pointcept/Pointcept, 2023

Pointcept Contributors. Pointcept: A codebase for point cloud perception research.https://github.com/ Pointcept/Pointcept, 2023. 5

work page 2023
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 5828–5839, 2017. 2, 5, 6, 8, 12, 13

work page 2017
[9]

Super sparse 3d object detection.IEEE Trans

Lue Fan, Yuxue Yang, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Super sparse 3d object detection.IEEE Trans. Pattern Anal. Mach. Intell., 45(10):12490–12505,

work page
[10]

Parameter efficient point cloud prompt tuning for unified point cloud understanding.IEEE Trans

Ben Fei, Liwen Liu, Weidong Yang, Zhijun Li, Wen-Ming Chen, and Lipeng Ma. Parameter efficient point cloud prompt tuning for unified point cloud understanding.IEEE Trans. Intell. Vehicles, 10(1):255–271, 2025. 3

work page 2025
[11]

Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. InProc. IEEE Int. Conf. Comput. Vis., pages 24823–24834, 2025. 2

work page 2025
[12]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv preprint arXiv:2512.13636, 2025. 2

work page arXiv 2025
[13]

Pct: Point cloud transformer.Computational visual media, 7(2):187–199,

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational visual media, 7(2):187–199,

work page
[14]

Deep learning for 3d point clouds: A survey.IEEE Trans

Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 43(12):4338–4364, 2020. 1

work page 2020
[15]

Exploring data-efficient 3d scene understanding with contrastive scene contexts

Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 15587–15597, 2021. 3

work page 2021
[16]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InProc. Int. Conf. Mach. Learn., pages 2790–2799, 2019. 5, 6

work page 2019
[17]

Lora: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InProc. Int. Conf. Learn. Representations, 2022. 3, 5, 6, 12

work page 2022
[18]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InProc. Eur. Conf. Comput. Vis., pages 709–727, 2022. 3

work page 2022
[19]

Pointgroup: Dual-set point grouping for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4867–4876, 2020. 1

work page 2020
[20]

Vera: Vector-based random matrix adaptation

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. In Proc. Int. Conf. Learn. Representations, 2024. 5, 6, 12

work page 2024
[21]

Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection

Jingyu Li, Zhe Liu, Jinghua Hou, and Dingkang Liang. Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. InProc. IEEE Int. Conf. Robotics Automation, pages 9245–9252, 2023. 2

work page 2023
[22]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, and Li Zhang. Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2026. 1

work page 2026
[23]

Imagidrive: A unified imagination-and- planning framework for autonomous driving

Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, and Li Zhang. Imagidrive: A unified imagination-and- planning framework for autonomous driving. InProc. IEEE Int. Conf. Robotics Automation, 2026. 2

work page 2026
[24]

Geoteacher: Geometry-guided semi-supervised 3d object detection

Jingyu Li, Xiaolong Zhao, Zhe Liu, Wenxiao Wu, and Li Zhang. Geoteacher: Geometry-guided semi-supervised 3d object detection. InProc. IEEE Int. Conf. Robotics Automa- tion, 2026. 2

work page 2026
[25]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProc. Annual Meet- ing of the Association for Computational Linguistics, pages 4582–4597, 2021. 3, 5, 6, 12

work page 2021
[26]

Pointmamba: A simple state space model for point cloud analysis

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis. InProc. Adv. Neural Inf. Process. Syst., pages 32653–32677, 2024. 2 9

work page 2024
[27]

Parameter-efficient fine-tuning in spectral domain for point cloud learning.IEEE Trans

Dingkang Liang, Tianrui Feng, Xin Zhou, Yumeng Zhang, Zhikang Zou, and Xiang Bai. Parameter-efficient fine-tuning in spectral domain for point cloud learning.IEEE Trans. Pattern Anal. Mach. Intell., 47(12):10949–10966, 2025. 1, 2, 3, 5, 6, 12

work page 2025
[28]

Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Trans

Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Trans. Pattern Anal. Mach. Intell., 48(1):840–858, 2025. 2

work page 2025
[29]

Cook and clean together: Teaching embodied agents for parallel task execution

Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, and Xiang Bai. Cook and clean together: Teaching embodied agents for parallel task execution. In Proc. AAAI Conf. Artif. Intell., pages 18415–18424, 2026. 1

work page 2026
[30]

Unifuture: A 4d driving world model for future generation and perception

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Unifuture: A 4d driving world model for future generation and perception. InProc. IEEE Int. Conf. Robotics Automation, 2026. 1

work page 2026
[31]

A closer look at local aggregation operators in point cloud anal- ysis

Ze Liu, Han Hu, Yue Cao, Zheng Zhang, and Xin Tong. A closer look at local aggregation operators in point cloud anal- ysis. InProc. Eur. Conf. Comput. Vis., pages 326–342, 2020. 2

work page 2020
[32]

Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

Dening Lu, Qian Xie, Mingqiang Wei, Kyle Gao, Linlin Xu, and Jonathan Li. Transformers in 3d point clouds: A survey. arXiv preprint arXiv:2205.07417, 2022. 1

work page arXiv 2022
[33]

V oxel transformer for 3d object detection

Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. V oxel transformer for 3d object detection. InProc. IEEE Int. Conf. Comput. Vis., pages 3164–3173, 2021. 3

work page 2021
[34]

Masked autoencoders for point cloud self-supervised learning

Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InProc. Eur. Conf. Comput. Vis., pages 604–621, 2022. 3

work page 2022
[35]

Self-positioning point-based transformer for point cloud understanding

Jinyoung Park, Sanghyeok Lee, Sihyeon Kim, Yunyang Xiong, and Hyunwoo J Kim. Self-positioning point-based transformer for point cloud understanding. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21814–21823,

work page
[36]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 652–660, 2017. 2

work page 2017
[37]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. InProc. Adv. Neural Inf. Pro- cess. Syst., pages 5105–5114, 2017. 2

work page 2017
[38]

Pointnext: Revisiting pointnet++ with improved training and scaling strategies

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. InProc. Adv. Neural Inf. Pro- cess. Syst., pages 23192–23204, 2022. 2, 6

work page 2022
[39]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In Proc. Eur. Conf. Comput. Vis., pages 125–141, 2022. 12, 13

work page 2022
[40]

Multi-view convolutional neural networks for 3d shape recognition

Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProc. IEEE Int. Conf. Comput. Vis., pages 945–953, 2015. 3

work page 2015
[41]

Parameter-efficient prompt learning for 3d point cloud understanding

Hongyu Sun, Yongcai Wang, Wang Chen, Haoran Deng, and Deying Li. Parameter-efficient prompt learning for 3d point cloud understanding. InProc. IEEE Int. Conf. Robotics Au- tomation, pages 9478–9486, 2024. 3

work page 2024
[42]

Point- peft: Parameter-efficient fine-tuning for 3d pre-trained mod- els

Yiwen Tang, Ray Zhang, Zoey Guo, Xianzheng Ma, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Point- peft: Parameter-efficient fine-tuning for 3d pre-trained mod- els. InProc. AAAI Conf. Artif. Intell., pages 5171–5179,

work page
[43]

Any2point: Empowering any-modality large models for efficient 3d understanding

Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Bin Zhao, Zhigang Wang, Peng Gao, Hongsheng Li, Dong Wang, and Xuelong Li. Any2point: Empowering any-modality large models for efficient 3d understanding. InProc. Eur. Conf. Comput. Vis., pages 456–473, 2024. 3

work page 2024
[44]

Dylora: Parameter-efficient tuning of pre- trained models using dynamic search-free low-rank adapta- tion

Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter-efficient tuning of pre- trained models using dynamic search-free low-rank adapta- tion. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, 2023. 3

work page 2023
[45]

Dynamic graph cnn for learning on point clouds.ACM Trans

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Trans. ON Graphics, 38(5):1–12, 2019. 3

work page 2019
[46]

Point transformer v2: Grouped vector atten- tion and partition-based pooling

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling. InProc. Adv. Neural Inf. Process. Syst., pages 33330–33342, 2022. 2

work page 2022
[47]

Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning. InProc. IEEE Conf. Com- put. Vis. Pattern Recognit., pages 9415–9424, 2023. 1, 3

work page 2023
[48]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4840– 4851, 2024. 2, 3, 5, 6

work page 2024
[49]

Towards large- scale 3d representation learning with multi-dataset point prompt training

Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 19551–19562, 2024. 6

work page 2024
[50]

Sonata: Self- supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 22193– 22204, 2025. 1, 3, 5, 6, 12

work page 2025
[51]

Walk in the cloud: Learning curves for point clouds shape analysis

Tiange Xiang, Chaoyi Zhang, Yang Song, Jianhui Yu, and Weidong Cai. Walk in the cloud: Learning curves for point clouds shape analysis. InProc. IEEE Int. Conf. Comput. Vis., pages 915–924, 2021. 2

work page 2021
[52]

Pointcontrast: Unsupervised pre- training for 3d point cloud understanding

Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InProc. Eur. Conf. Comput. Vis., pages 574–591, 2020. 1, 3 10

work page 2020
[53]

A unified framework for 3d scene un- derstanding

Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, and Xiang Bai. A unified framework for 3d scene un- derstanding. InProc. Adv. Neural Inf. Process. Syst., pages 59468–59490, 2024. 2

work page 2024
[54]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProc. IEEE Int. Conf. Comput. Vis., pages 12–22, 2023. 2, 5, 6, 13

work page 2023
[55]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 19313–19322,

work page
[56]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProc. Annual Meeting of the Association for Computational Linguistics, pages 1–9, 2022. 5, 6, 12

work page 2022
[57]

Sfr: Semantic-aware feature ren- dering of point cloud

Yaohua Zha, Rongsheng Li, Tao Dai, Jianyu Xiong, Xin Wang, and Shu-Tao Xia. Sfr: Semantic-aware feature ren- dering of point cloud. InProc. Int. Conf. Acoustics, Speech, Signal Process., pages 1–5, 2023. 1

work page 2023
[58]

Instance-aware dynamic prompt tuning for pre-trained point cloud models

Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Instance-aware dynamic prompt tuning for pre-trained point cloud models. InProc. IEEE Int. Conf. Comput. Vis., pages 14161–14170, 2023. 1, 3, 5, 6, 8, 12, 13

work page 2023
[59]

Towards compact 3d representations via point feature enhancement masked au- toencoders

Yaohua Zha, Huizhen Ji, Jinmin Li, Rongsheng Li, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Towards compact 3d representations via point feature enhancement masked au- toencoders. InProc. AAAI Conf. Artif. Intell., pages 6962– 6970, 2024. 2

work page 2024
[60]

Lcm: Locally constrained compact point cloud model for masked point modeling

Yaohua Zha, Naiqi Li, Yanzi Wang, Tao Dai, Hang Guo, Bin Chen, Zhi Wang, Zhihao Ouyang, and Shu-Tao Xia. Lcm: Locally constrained compact point cloud model for masked point modeling. pages 104816–104842, 2024. 2

work page 2024
[61]

Pre- training point cloud compact model with partial-aware re- construction.arXiv preprint arXiv:2407.09344, 2024

Yaohua Zha, Yanzi Wang, Tao Dai, and Shu-Tao Xia. Pre- training point cloud compact model with partial-aware re- construction.arXiv preprint arXiv:2407.09344, 2024. 3

work page arXiv 2024
[62]

Point cloud mixture-of-domain- experts model for 3d self-supervised learning

Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, and Shu-Tao Xia. Point cloud mixture-of-domain- experts model for 3d self-supervised learning. InProc. Int. Joint Conf. Artif. Intell., pages 2332–2340, 2025. 3

work page 2025
[63]

Pma: Towards parameter-efficient point cloud understanding via point mamba adapter

Yaohua Zha, Yanzi Wang, Hang Guo, Jinpeng Wang, Tao Dai, Bin Chen, Zhihao Ouyang, Xue Yuerong, Ke Chen, and Shu-Tao Xia. Pma: Towards parameter-efficient point cloud understanding via point mamba adapter. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 16976–16986, 2025. 3

work page 2025
[64]

A simple vision transformer for weakly semi-supervised 3d object de- tection

Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, and Xiang Bai. A simple vision transformer for weakly semi-supervised 3d object de- tection. InProc. IEEE Int. Conf. Comput. Vis., pages 8373– 8383, 2023. 1

work page 2023
[65]

Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. InProc. Adv. Neural Inf. Process. Syst., pages 27061–27074, 2022. 2, 3

work page 2022
[66]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 8552– 8562, 2022. 3

work page 2022
[67]

Starting from non-parametric net- works for 3d point cloud analysis

Renrui Zhang, Liuhui Wang, Yali Wang, Peng Gao, Hong- sheng Li, and Jianbo Shi. Starting from non-parametric net- works for 3d point cloud analysis. InProc. IEEE Conf. Com- put. Vis. Pattern Recognit., pages 5344–5353, 2023. 2

work page 2023
[68]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProc. IEEE Int. Conf. Comput. Vis., pages 16259–16268, 2021. 1, 3

work page 2021
[69]

Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis

Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, and Xiang Bai. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 14707–14717, 2024. 1, 3, 5, 6, 12, 13

work page 2024
[70]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. In Proc. IEEE Int. Conf. Comput. Vis., pages 27817–27827,

work page
[71]

V oxelnet: End-to-end learn- ing for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learn- ing for point cloud based 3d object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4490– 4499, 2018. 3

work page 2018
[72]

Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InProc. IEEE Int. Conf. Comput. Vis., pages 2639– 2650, 2023. 3 11 Supplementary Material S1. Additional Experiments S1.1. Analysis on Different Rank One of our core hyperparame...

work page 2023

[1] [1]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9902–9912,

work page

[2] [2]

Randlora: full rank parameter-efficient fine- tuning of large models

Paul Albert, Frederic Z Zhang, Cristian Rodriguez-Opazo, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad. Randlora: full rank parameter-efficient fine- tuning of large models. 2024. 6, 12

work page 2024

[3] [3]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1534–1543, 2016. 5, 6, 13

work page 2016

[4] [4]

Clip2scene: Towards label-efficient 3d scene under- standing by clip

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. Clip2scene: Towards label-efficient 3d scene under- standing by clip. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 7020–7030, 2023. 3

work page 2023

[5] [5]

V oxelnext: Fully sparse voxelnet for 3d object detection and tracking

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object detection and tracking. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21674–21683, 2023. 3

work page 2023

[6] [6]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProc. IEEE Conf. Comput. Vis. Pattern Recog- nit., pages 3075–3084, 2019. 6

work page 2019

[7] [7]

Pointcept: A codebase for point cloud perception research.https://github.com/ Pointcept/Pointcept, 2023

Pointcept Contributors. Pointcept: A codebase for point cloud perception research.https://github.com/ Pointcept/Pointcept, 2023. 5

work page 2023

[8] [8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 5828–5839, 2017. 2, 5, 6, 8, 12, 13

work page 2017

[9] [9]

Super sparse 3d object detection.IEEE Trans

Lue Fan, Yuxue Yang, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Super sparse 3d object detection.IEEE Trans. Pattern Anal. Mach. Intell., 45(10):12490–12505,

work page

[10] [10]

Parameter efficient point cloud prompt tuning for unified point cloud understanding.IEEE Trans

Ben Fei, Liwen Liu, Weidong Yang, Zhijun Li, Wen-Ming Chen, and Lipeng Ma. Parameter efficient point cloud prompt tuning for unified point cloud understanding.IEEE Trans. Intell. Vehicles, 10(1):255–271, 2025. 3

work page 2025

[11] [11]

Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. InProc. IEEE Int. Conf. Comput. Vis., pages 24823–24834, 2025. 2

work page 2025

[12] [12]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv preprint arXiv:2512.13636, 2025. 2

work page arXiv 2025

[13] [13]

Pct: Point cloud transformer.Computational visual media, 7(2):187–199,

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational visual media, 7(2):187–199,

work page

[14] [14]

Deep learning for 3d point clouds: A survey.IEEE Trans

Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 43(12):4338–4364, 2020. 1

work page 2020

[15] [15]

Exploring data-efficient 3d scene understanding with contrastive scene contexts

Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 15587–15597, 2021. 3

work page 2021

[16] [16]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InProc. Int. Conf. Mach. Learn., pages 2790–2799, 2019. 5, 6

work page 2019

[17] [17]

Lora: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InProc. Int. Conf. Learn. Representations, 2022. 3, 5, 6, 12

work page 2022

[18] [18]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InProc. Eur. Conf. Comput. Vis., pages 709–727, 2022. 3

work page 2022

[19] [19]

Pointgroup: Dual-set point grouping for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4867–4876, 2020. 1

work page 2020

[20] [20]

Vera: Vector-based random matrix adaptation

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. In Proc. Int. Conf. Learn. Representations, 2024. 5, 6, 12

work page 2024

[21] [21]

Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection

Jingyu Li, Zhe Liu, Jinghua Hou, and Dingkang Liang. Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. InProc. IEEE Int. Conf. Robotics Automation, pages 9245–9252, 2023. 2

work page 2023

[22] [22]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, and Li Zhang. Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2026. 1

work page 2026

[23] [23]

Imagidrive: A unified imagination-and- planning framework for autonomous driving

Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, and Li Zhang. Imagidrive: A unified imagination-and- planning framework for autonomous driving. InProc. IEEE Int. Conf. Robotics Automation, 2026. 2

work page 2026

[24] [24]

Geoteacher: Geometry-guided semi-supervised 3d object detection

Jingyu Li, Xiaolong Zhao, Zhe Liu, Wenxiao Wu, and Li Zhang. Geoteacher: Geometry-guided semi-supervised 3d object detection. InProc. IEEE Int. Conf. Robotics Automa- tion, 2026. 2

work page 2026

[25] [25]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProc. Annual Meet- ing of the Association for Computational Linguistics, pages 4582–4597, 2021. 3, 5, 6, 12

work page 2021

[26] [26]

Pointmamba: A simple state space model for point cloud analysis

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis. InProc. Adv. Neural Inf. Process. Syst., pages 32653–32677, 2024. 2 9

work page 2024

[27] [27]

Parameter-efficient fine-tuning in spectral domain for point cloud learning.IEEE Trans

Dingkang Liang, Tianrui Feng, Xin Zhou, Yumeng Zhang, Zhikang Zou, and Xiang Bai. Parameter-efficient fine-tuning in spectral domain for point cloud learning.IEEE Trans. Pattern Anal. Mach. Intell., 47(12):10949–10966, 2025. 1, 2, 3, 5, 6, 12

work page 2025

[28] [28]

Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Trans

Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Trans. Pattern Anal. Mach. Intell., 48(1):840–858, 2025. 2

work page 2025

[29] [29]

Cook and clean together: Teaching embodied agents for parallel task execution

Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, and Xiang Bai. Cook and clean together: Teaching embodied agents for parallel task execution. In Proc. AAAI Conf. Artif. Intell., pages 18415–18424, 2026. 1

work page 2026

[30] [30]

Unifuture: A 4d driving world model for future generation and perception

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Unifuture: A 4d driving world model for future generation and perception. InProc. IEEE Int. Conf. Robotics Automation, 2026. 1

work page 2026

[31] [31]

A closer look at local aggregation operators in point cloud anal- ysis

Ze Liu, Han Hu, Yue Cao, Zheng Zhang, and Xin Tong. A closer look at local aggregation operators in point cloud anal- ysis. InProc. Eur. Conf. Comput. Vis., pages 326–342, 2020. 2

work page 2020

[32] [32]

Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

Dening Lu, Qian Xie, Mingqiang Wei, Kyle Gao, Linlin Xu, and Jonathan Li. Transformers in 3d point clouds: A survey. arXiv preprint arXiv:2205.07417, 2022. 1

work page arXiv 2022

[33] [33]

V oxel transformer for 3d object detection

Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. V oxel transformer for 3d object detection. InProc. IEEE Int. Conf. Comput. Vis., pages 3164–3173, 2021. 3

work page 2021

[34] [34]

Masked autoencoders for point cloud self-supervised learning

Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InProc. Eur. Conf. Comput. Vis., pages 604–621, 2022. 3

work page 2022

[35] [35]

Self-positioning point-based transformer for point cloud understanding

Jinyoung Park, Sanghyeok Lee, Sihyeon Kim, Yunyang Xiong, and Hyunwoo J Kim. Self-positioning point-based transformer for point cloud understanding. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21814–21823,

work page

[36] [36]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 652–660, 2017. 2

work page 2017

[37] [37]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. InProc. Adv. Neural Inf. Pro- cess. Syst., pages 5105–5114, 2017. 2

work page 2017

[38] [38]

Pointnext: Revisiting pointnet++ with improved training and scaling strategies

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. InProc. Adv. Neural Inf. Pro- cess. Syst., pages 23192–23204, 2022. 2, 6

work page 2022

[39] [39]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In Proc. Eur. Conf. Comput. Vis., pages 125–141, 2022. 12, 13

work page 2022

[40] [40]

Multi-view convolutional neural networks for 3d shape recognition

Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProc. IEEE Int. Conf. Comput. Vis., pages 945–953, 2015. 3

work page 2015

[41] [41]

Parameter-efficient prompt learning for 3d point cloud understanding

Hongyu Sun, Yongcai Wang, Wang Chen, Haoran Deng, and Deying Li. Parameter-efficient prompt learning for 3d point cloud understanding. InProc. IEEE Int. Conf. Robotics Au- tomation, pages 9478–9486, 2024. 3

work page 2024

[42] [42]

Point- peft: Parameter-efficient fine-tuning for 3d pre-trained mod- els

Yiwen Tang, Ray Zhang, Zoey Guo, Xianzheng Ma, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Point- peft: Parameter-efficient fine-tuning for 3d pre-trained mod- els. InProc. AAAI Conf. Artif. Intell., pages 5171–5179,

work page

[43] [43]

Any2point: Empowering any-modality large models for efficient 3d understanding

Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Bin Zhao, Zhigang Wang, Peng Gao, Hongsheng Li, Dong Wang, and Xuelong Li. Any2point: Empowering any-modality large models for efficient 3d understanding. InProc. Eur. Conf. Comput. Vis., pages 456–473, 2024. 3

work page 2024

[44] [44]

Dylora: Parameter-efficient tuning of pre- trained models using dynamic search-free low-rank adapta- tion

Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter-efficient tuning of pre- trained models using dynamic search-free low-rank adapta- tion. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, 2023. 3

work page 2023

[45] [45]

Dynamic graph cnn for learning on point clouds.ACM Trans

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Trans. ON Graphics, 38(5):1–12, 2019. 3

work page 2019

[46] [46]

Point transformer v2: Grouped vector atten- tion and partition-based pooling

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling. InProc. Adv. Neural Inf. Process. Syst., pages 33330–33342, 2022. 2

work page 2022

[47] [47]

Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning. InProc. IEEE Conf. Com- put. Vis. Pattern Recognit., pages 9415–9424, 2023. 1, 3

work page 2023

[48] [48]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4840– 4851, 2024. 2, 3, 5, 6

work page 2024

[49] [49]

Towards large- scale 3d representation learning with multi-dataset point prompt training

Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 19551–19562, 2024. 6

work page 2024

[50] [50]

Sonata: Self- supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 22193– 22204, 2025. 1, 3, 5, 6, 12

work page 2025

[51] [51]

Walk in the cloud: Learning curves for point clouds shape analysis

Tiange Xiang, Chaoyi Zhang, Yang Song, Jianhui Yu, and Weidong Cai. Walk in the cloud: Learning curves for point clouds shape analysis. InProc. IEEE Int. Conf. Comput. Vis., pages 915–924, 2021. 2

work page 2021

[52] [52]

Pointcontrast: Unsupervised pre- training for 3d point cloud understanding

Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InProc. Eur. Conf. Comput. Vis., pages 574–591, 2020. 1, 3 10

work page 2020

[53] [53]

A unified framework for 3d scene un- derstanding

Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, and Xiang Bai. A unified framework for 3d scene un- derstanding. InProc. Adv. Neural Inf. Process. Syst., pages 59468–59490, 2024. 2

work page 2024

[54] [54]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProc. IEEE Int. Conf. Comput. Vis., pages 12–22, 2023. 2, 5, 6, 13

work page 2023

[55] [55]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 19313–19322,

work page

[56] [56]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProc. Annual Meeting of the Association for Computational Linguistics, pages 1–9, 2022. 5, 6, 12

work page 2022

[57] [57]

Sfr: Semantic-aware feature ren- dering of point cloud

Yaohua Zha, Rongsheng Li, Tao Dai, Jianyu Xiong, Xin Wang, and Shu-Tao Xia. Sfr: Semantic-aware feature ren- dering of point cloud. InProc. Int. Conf. Acoustics, Speech, Signal Process., pages 1–5, 2023. 1

work page 2023

[58] [58]

Instance-aware dynamic prompt tuning for pre-trained point cloud models

Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Instance-aware dynamic prompt tuning for pre-trained point cloud models. InProc. IEEE Int. Conf. Comput. Vis., pages 14161–14170, 2023. 1, 3, 5, 6, 8, 12, 13

work page 2023

[59] [59]

Towards compact 3d representations via point feature enhancement masked au- toencoders

Yaohua Zha, Huizhen Ji, Jinmin Li, Rongsheng Li, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Towards compact 3d representations via point feature enhancement masked au- toencoders. InProc. AAAI Conf. Artif. Intell., pages 6962– 6970, 2024. 2

work page 2024

[60] [60]

Lcm: Locally constrained compact point cloud model for masked point modeling

Yaohua Zha, Naiqi Li, Yanzi Wang, Tao Dai, Hang Guo, Bin Chen, Zhi Wang, Zhihao Ouyang, and Shu-Tao Xia. Lcm: Locally constrained compact point cloud model for masked point modeling. pages 104816–104842, 2024. 2

work page 2024

[61] [61]

Pre- training point cloud compact model with partial-aware re- construction.arXiv preprint arXiv:2407.09344, 2024

Yaohua Zha, Yanzi Wang, Tao Dai, and Shu-Tao Xia. Pre- training point cloud compact model with partial-aware re- construction.arXiv preprint arXiv:2407.09344, 2024. 3

work page arXiv 2024

[62] [62]

Point cloud mixture-of-domain- experts model for 3d self-supervised learning

Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, and Shu-Tao Xia. Point cloud mixture-of-domain- experts model for 3d self-supervised learning. InProc. Int. Joint Conf. Artif. Intell., pages 2332–2340, 2025. 3

work page 2025

[63] [63]

Pma: Towards parameter-efficient point cloud understanding via point mamba adapter

Yaohua Zha, Yanzi Wang, Hang Guo, Jinpeng Wang, Tao Dai, Bin Chen, Zhihao Ouyang, Xue Yuerong, Ke Chen, and Shu-Tao Xia. Pma: Towards parameter-efficient point cloud understanding via point mamba adapter. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 16976–16986, 2025. 3

work page 2025

[64] [64]

A simple vision transformer for weakly semi-supervised 3d object de- tection

Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, and Xiang Bai. A simple vision transformer for weakly semi-supervised 3d object de- tection. InProc. IEEE Int. Conf. Comput. Vis., pages 8373– 8383, 2023. 1

work page 2023

[65] [65]

Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. InProc. Adv. Neural Inf. Process. Syst., pages 27061–27074, 2022. 2, 3

work page 2022

[66] [66]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 8552– 8562, 2022. 3

work page 2022

[67] [67]

Starting from non-parametric net- works for 3d point cloud analysis

Renrui Zhang, Liuhui Wang, Yali Wang, Peng Gao, Hong- sheng Li, and Jianbo Shi. Starting from non-parametric net- works for 3d point cloud analysis. InProc. IEEE Conf. Com- put. Vis. Pattern Recognit., pages 5344–5353, 2023. 2

work page 2023

[68] [68]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProc. IEEE Int. Conf. Comput. Vis., pages 16259–16268, 2021. 1, 3

work page 2021

[69] [69]

Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis

Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, and Xiang Bai. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 14707–14717, 2024. 1, 3, 5, 6, 12, 13

work page 2024

[70] [70]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. In Proc. IEEE Int. Conf. Comput. Vis., pages 27817–27827,

work page

[71] [71]

V oxelnet: End-to-end learn- ing for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learn- ing for point cloud based 3d object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4490– 4499, 2018. 3

work page 2018

[72] [72]

Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InProc. IEEE Int. Conf. Comput. Vis., pages 2639– 2650, 2023. 3 11 Supplementary Material S1. Additional Experiments S1.1. Analysis on Different Rank One of our core hyperparame...

work page 2023