pith. sign in

arxiv: 2604.04933 · v1 · submitted 2026-04-06 · 💻 cs.CV

PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene understandingpoint cloud segmentationtest-time adaptationdynamic parametersparameter-efficient fine-tuningsemantic segmentationPTv3 backbone
0
0 comments X

The pith

PointTPA generates input-aware parameters for local patches in 3D point clouds, raising ScanNet mIoU to 78.4 percent with under 2 percent added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a network can improve its handling of varied 3D scenes by creating fresh weights on the fly for each local patch instead of keeping one fixed set of parameters throughout inference. Standard backbones struggle because real scenes differ in geometry, object balance, and layout, yet most methods lock the weights after training. PointTPA adds two small modules that first group points into coherent patches and then project scene-specific weights for those patches, letting the model adjust its behavior without retraining the whole network. A sympathetic reader would care because this keeps the model small and fast while making it more responsive to the unpredictable structure of real environments such as rooms or streets.

Core claim

PointTPA is a test-time parameter adaptation framework that uses Serialization-based Neighborhood Grouping to form locally coherent patches from input point clouds and a Dynamic Parameter Projector to produce patch-wise adaptive weights; when integrated into the PTv3 backbone these two lightweight modules, together less than 2 percent of the original parameters, enable the network to adjust its behavior to scene-specific variations and reach 78.4 percent mIoU on ScanNet validation while outperforming prior parameter-efficient fine-tuning approaches on multiple benchmarks.

What carries the argument

The Dynamic Parameter Projector, which takes patch features from Serialization-based Neighborhood Grouping and outputs custom network weights for each patch so the backbone can change its computation according to the current scene.

If this is right

  • The backbone maintains strong performance on ScanNet validation while the added modules stay below 2 percent of its parameter count.
  • The same modules surpass existing parameter-efficient fine-tuning methods across several 3D scene benchmarks.
  • The network adjusts its internal behavior to each scene's geometry and layout during inference without any additional training pass.
  • Local patch grouping followed by per-patch weight generation keeps the adaptation both spatially coherent and computationally light.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same patch-wise adaptation idea could be tested on outdoor LiDAR data where scene layouts change even more abruptly than in indoor scans.
  • If the projector proves stable, future models might replace heavy pre-training on mixed datasets with lightweight on-the-fly adjustment for each new environment.
  • The approach hints that conditional weight generation may be more efficient than adding more layers or channels when the goal is robustness to scene diversity.

Load-bearing premise

The patch-wise parameters produced by the Dynamic Parameter Projector will improve results on diverse scenes without introducing instability or requiring scene-specific tuning that was not disclosed.

What would settle it

Running PointTPA on a new collection of indoor scenes with deliberately varied layouts and measuring whether mIoU falls below the static PTv3 baseline or fluctuates sharply when the projector is replaced by random weights of the same size.

Figures

Figures reproduced from arXiv: 2604.04933 by Chaoqun Zheng, Dingkang Liang, Siyuan Liu, Tianrui Feng, Xiang Bai, Xin Zhou.

Figure 1
Figure 1. Figure 1: (a) Scene-level point clouds have more points and highly imbalanced category distributions compared to object-level point clouds. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PointTPA. It consists of a Serialization-based Neighborhood Grouping (SNG) and a Dynamic Parameter Projector [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our mixed-insertion strategy. PointTPA [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of PEFT methods and FFT on ScanNet [ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the similarity of dynamic weights. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison against IDPT [58]. Green and red boxes indicate correct and incorrect segmentations, respec￾tively, with GT denoting the ground truth. timal balance between representational capacity and train￾ing stability, we evaluate various DPP insertion strategies (Tab. 5). Notably, a dense DPP configuration introduces redundant parameters and degrades performance, reducing mIoU by 0.6% and allA… view at source ↗
Figure 1
Figure 1. Figure 1: A comparison of FFT, IDPT [58], DAPT [69], and PointTPA on segmentation performance, evaluated on ScanNet [8]. (a) (b) (c) (d) [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: More visualizations of the semantic segmentation results of our PointTPA on four large-scale scene datasets. (a) ScanNet [ [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone's parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at https://github.com/H-EmbodVis/PointTPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PointTPA, a test-time parameter adaptation framework for scene-level 3D point cloud understanding. It introduces Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches from point clouds and a Dynamic Parameter Projector (DPP) to generate input-aware patch-wise weights that adapt the PTv3 backbone to scene-specific geometry and layout variations. The two modules add less than 2% parameters to the backbone; the method reports 78.4% mIoU on ScanNet validation and outperforms existing parameter-efficient fine-tuning (PEFT) approaches across multiple benchmarks.

Significance. If the dynamic adaptation mechanism proves robust, the work would offer a practical route to parameter-efficient handling of diverse 3D scenes without retraining or large overhead, addressing a real limitation of static networks in scene understanding. The low parameter count and code release are positive for reproducibility and deployment.

major comments (3)
  1. [Experimental results] Experimental results: the headline 78.4% mIoU on ScanNet validation is presented without an ablation that isolates the contribution of the Dynamic Parameter Projector (DPP) from the Serialization-based Neighborhood Grouping (SNG) alone; this is load-bearing for the central claim that input-aware dynamic weights drive the improvement.
  2. [Experimental results] Experimental results: no per-scene or per-category variance statistics or stability analysis is reported for the patch-wise parameters produced by DPP, leaving the claim of reliable adaptation across diverse geometries and layouts unverified.
  3. [Method] Method description: the manuscript provides no statement on whether the DPP projection weights or any adaptation step requires scene-dependent hyperparameter choices; if such tuning is present but undisclosed, the parameter-efficiency argument is weakened.
minor comments (2)
  1. [Method] The abstract and method sections use the term 'test-time' but the precise inference-time procedure (e.g., whether DPP runs once per scene or per patch) should be clarified with a diagram or pseudocode.
  2. [Experiments] Table captions and baseline descriptions should explicitly state whether all compared PEFT methods were trained with identical optimizer, schedule, and data augmentation settings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, agreeing where revisions are needed to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results: the headline 78.4% mIoU on ScanNet validation is presented without an ablation that isolates the contribution of the Dynamic Parameter Projector (DPP) from the Serialization-based Neighborhood Grouping (SNG) alone; this is load-bearing for the central claim that input-aware dynamic weights drive the improvement.

    Authors: We agree that an explicit ablation isolating the DPP from SNG would more directly support the central claim regarding the benefit of input-aware dynamic weights. Our current experiments demonstrate gains of the full PointTPA (SNG + DPP) over the PTv3 baseline and PEFT methods, but do not include this specific isolation. We will add the requested ablation in the revised manuscript, evaluating SNG with static parameters versus the full dynamic adaptation. revision: yes

  2. Referee: [Experimental results] Experimental results: no per-scene or per-category variance statistics or stability analysis is reported for the patch-wise parameters produced by DPP, leaving the claim of reliable adaptation across diverse geometries and layouts unverified.

    Authors: We acknowledge that variance and stability statistics would provide stronger verification of reliable adaptation. While overall benchmark improvements suggest robustness, such per-scene analysis was not included in the original submission. In the revision, we will incorporate per-scene and per-category variance statistics for the DPP-generated parameters along with basic stability metrics. revision: yes

  3. Referee: [Method] Method description: the manuscript provides no statement on whether the DPP projection weights or any adaptation step requires scene-dependent hyperparameter choices; if such tuning is present but undisclosed, the parameter-efficiency argument is weakened.

    Authors: We confirm that the DPP projection weights and all adaptation steps use fixed hyperparameters with no scene-dependent choices or tuning. These values were selected once on a validation set and held constant across all scenes. We will add an explicit clarifying statement in the method section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal with no load-bearing derivations

full rationale

The paper proposes PointTPA as a test-time adaptation framework consisting of Serialization-based Neighborhood Grouping (SNG) and Dynamic Parameter Projector (DPP) modules inserted into PTv3. All central claims are framed as empirical outcomes: the modules add <2% parameters and yield 78.4% mIoU on ScanNet validation while outperforming PEFT baselines. No equations, uniqueness theorems, fitted-parameter predictions, or self-citation chains are invoked that would reduce the reported gains to the inputs by construction. The derivation chain is therefore self-contained as an engineering contribution whose validity rests on external benchmark results rather than internal redefinition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of two newly introduced modules whose internal design choices (patch size, projector dimensions, serialization order) are not detailed in the abstract and are therefore treated as free parameters or domain assumptions.

free parameters (2)
  • SNG patch size and serialization parameters
    Chosen to produce locally coherent patches; exact values not given in abstract but required for the grouping step.
  • DPP output dimension and projection weights
    Determines the adaptive weights per patch; kept under 2% of backbone but still a design choice fitted for reported performance.
axioms (1)
  • domain assumption PTv3 backbone layers can accept and benefit from externally supplied patch-wise parameters without retraining the core weights.
    Invoked when the paper states integration into PTv3 while keeping overhead low.
invented entities (2)
  • Serialization-based Neighborhood Grouping (SNG) no independent evidence
    purpose: To form locally coherent patches from unordered point clouds
    New module introduced by the paper; no independent evidence outside this work.
  • Dynamic Parameter Projector (DPP) no independent evidence
    purpose: To generate input-aware network parameters for each patch
    New module introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5512 in / 1493 out tokens · 40017 ms · 2026-05-10T18:40:11.450336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

  1. [1]

    Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

    Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9902–9912,

  2. [2]

    Randlora: full rank parameter-efficient fine- tuning of large models

    Paul Albert, Frederic Z Zhang, Cristian Rodriguez-Opazo, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad. Randlora: full rank parameter-efficient fine- tuning of large models. 2024. 6, 12

  3. [3]

    3d semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1534–1543, 2016. 5, 6, 13

  4. [4]

    Clip2scene: Towards label-efficient 3d scene under- standing by clip

    Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. Clip2scene: Towards label-efficient 3d scene under- standing by clip. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 7020–7030, 2023. 3

  5. [5]

    V oxelnext: Fully sparse voxelnet for 3d object detection and tracking

    Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object detection and tracking. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21674–21683, 2023. 3

  6. [6]

    4d spatio-temporal convnets: Minkowski convolutional neural networks

    Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProc. IEEE Conf. Comput. Vis. Pattern Recog- nit., pages 3075–3084, 2019. 6

  7. [7]

    Pointcept: A codebase for point cloud perception research.https://github.com/ Pointcept/Pointcept, 2023

    Pointcept Contributors. Pointcept: A codebase for point cloud perception research.https://github.com/ Pointcept/Pointcept, 2023. 5

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 5828–5839, 2017. 2, 5, 6, 8, 12, 13

  9. [9]

    Super sparse 3d object detection.IEEE Trans

    Lue Fan, Yuxue Yang, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Super sparse 3d object detection.IEEE Trans. Pattern Anal. Mach. Intell., 45(10):12490–12505,

  10. [10]

    Parameter efficient point cloud prompt tuning for unified point cloud understanding.IEEE Trans

    Ben Fei, Liwen Liu, Weidong Yang, Zhijun Li, Wen-Ming Chen, and Lipeng Ma. Parameter efficient point cloud prompt tuning for unified point cloud understanding.IEEE Trans. Intell. Vehicles, 10(1):255–271, 2025. 3

  11. [11]

    Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. InProc. IEEE Int. Conf. Comput. Vis., pages 24823–24834, 2025. 2

  12. [12]

    Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv preprint arXiv:2512.13636, 2025. 2

  13. [13]

    Pct: Point cloud transformer.Computational visual media, 7(2):187–199,

    Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational visual media, 7(2):187–199,

  14. [14]

    Deep learning for 3d point clouds: A survey.IEEE Trans

    Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 43(12):4338–4364, 2020. 1

  15. [15]

    Exploring data-efficient 3d scene understanding with contrastive scene contexts

    Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 15587–15597, 2021. 3

  16. [16]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InProc. Int. Conf. Mach. Learn., pages 2790–2799, 2019. 5, 6

  17. [17]

    Lora: Low- rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InProc. Int. Conf. Learn. Representations, 2022. 3, 5, 6, 12

  18. [18]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InProc. Eur. Conf. Comput. Vis., pages 709–727, 2022. 3

  19. [19]

    Pointgroup: Dual-set point grouping for 3d instance segmentation

    Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4867–4876, 2020. 1

  20. [20]

    Vera: Vector-based random matrix adaptation

    Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. In Proc. Int. Conf. Learn. Representations, 2024. 5, 6, 12

  21. [21]

    Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection

    Jingyu Li, Zhe Liu, Jinghua Hou, and Dingkang Liang. Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. InProc. IEEE Int. Conf. Robotics Automation, pages 9245–9252, 2023. 2

  22. [22]

    Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving

    Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, and Li Zhang. Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2026. 1

  23. [23]

    Imagidrive: A unified imagination-and- planning framework for autonomous driving

    Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, and Li Zhang. Imagidrive: A unified imagination-and- planning framework for autonomous driving. InProc. IEEE Int. Conf. Robotics Automation, 2026. 2

  24. [24]

    Geoteacher: Geometry-guided semi-supervised 3d object detection

    Jingyu Li, Xiaolong Zhao, Zhe Liu, Wenxiao Wu, and Li Zhang. Geoteacher: Geometry-guided semi-supervised 3d object detection. InProc. IEEE Int. Conf. Robotics Automa- tion, 2026. 2

  25. [25]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProc. Annual Meet- ing of the Association for Computational Linguistics, pages 4582–4597, 2021. 3, 5, 6, 12

  26. [26]

    Pointmamba: A simple state space model for point cloud analysis

    Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis. InProc. Adv. Neural Inf. Process. Syst., pages 32653–32677, 2024. 2 9

  27. [27]

    Parameter-efficient fine-tuning in spectral domain for point cloud learning.IEEE Trans

    Dingkang Liang, Tianrui Feng, Xin Zhou, Yumeng Zhang, Zhikang Zou, and Xiang Bai. Parameter-efficient fine-tuning in spectral domain for point cloud learning.IEEE Trans. Pattern Anal. Mach. Intell., 47(12):10949–10966, 2025. 1, 2, 3, 5, 6, 12

  28. [28]

    Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Trans

    Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Trans. Pattern Anal. Mach. Intell., 48(1):840–858, 2025. 2

  29. [29]

    Cook and clean together: Teaching embodied agents for parallel task execution

    Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, and Xiang Bai. Cook and clean together: Teaching embodied agents for parallel task execution. In Proc. AAAI Conf. Artif. Intell., pages 18415–18424, 2026. 1

  30. [30]

    Unifuture: A 4d driving world model for future generation and perception

    Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Unifuture: A 4d driving world model for future generation and perception. InProc. IEEE Int. Conf. Robotics Automation, 2026. 1

  31. [31]

    A closer look at local aggregation operators in point cloud anal- ysis

    Ze Liu, Han Hu, Yue Cao, Zheng Zhang, and Xin Tong. A closer look at local aggregation operators in point cloud anal- ysis. InProc. Eur. Conf. Comput. Vis., pages 326–342, 2020. 2

  32. [32]

    Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

    Dening Lu, Qian Xie, Mingqiang Wei, Kyle Gao, Linlin Xu, and Jonathan Li. Transformers in 3d point clouds: A survey. arXiv preprint arXiv:2205.07417, 2022. 1

  33. [33]

    V oxel transformer for 3d object detection

    Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. V oxel transformer for 3d object detection. InProc. IEEE Int. Conf. Comput. Vis., pages 3164–3173, 2021. 3

  34. [34]

    Masked autoencoders for point cloud self-supervised learning

    Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InProc. Eur. Conf. Comput. Vis., pages 604–621, 2022. 3

  35. [35]

    Self-positioning point-based transformer for point cloud understanding

    Jinyoung Park, Sanghyeok Lee, Sihyeon Kim, Yunyang Xiong, and Hyunwoo J Kim. Self-positioning point-based transformer for point cloud understanding. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21814–21823,

  36. [36]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 652–660, 2017. 2

  37. [37]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. InProc. Adv. Neural Inf. Pro- cess. Syst., pages 5105–5114, 2017. 2

  38. [38]

    Pointnext: Revisiting pointnet++ with improved training and scaling strategies

    Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. InProc. Adv. Neural Inf. Pro- cess. Syst., pages 23192–23204, 2022. 2, 6

  39. [39]

    Language- grounded indoor 3d semantic segmentation in the wild

    David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In Proc. Eur. Conf. Comput. Vis., pages 125–141, 2022. 12, 13

  40. [40]

    Multi-view convolutional neural networks for 3d shape recognition

    Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProc. IEEE Int. Conf. Comput. Vis., pages 945–953, 2015. 3

  41. [41]

    Parameter-efficient prompt learning for 3d point cloud understanding

    Hongyu Sun, Yongcai Wang, Wang Chen, Haoran Deng, and Deying Li. Parameter-efficient prompt learning for 3d point cloud understanding. InProc. IEEE Int. Conf. Robotics Au- tomation, pages 9478–9486, 2024. 3

  42. [42]

    Point- peft: Parameter-efficient fine-tuning for 3d pre-trained mod- els

    Yiwen Tang, Ray Zhang, Zoey Guo, Xianzheng Ma, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Point- peft: Parameter-efficient fine-tuning for 3d pre-trained mod- els. InProc. AAAI Conf. Artif. Intell., pages 5171–5179,

  43. [43]

    Any2point: Empowering any-modality large models for efficient 3d understanding

    Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Bin Zhao, Zhigang Wang, Peng Gao, Hongsheng Li, Dong Wang, and Xuelong Li. Any2point: Empowering any-modality large models for efficient 3d understanding. InProc. Eur. Conf. Comput. Vis., pages 456–473, 2024. 3

  44. [44]

    Dylora: Parameter-efficient tuning of pre- trained models using dynamic search-free low-rank adapta- tion

    Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter-efficient tuning of pre- trained models using dynamic search-free low-rank adapta- tion. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, 2023. 3

  45. [45]

    Dynamic graph cnn for learning on point clouds.ACM Trans

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Trans. ON Graphics, 38(5):1–12, 2019. 3

  46. [46]

    Point transformer v2: Grouped vector atten- tion and partition-based pooling

    Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling. InProc. Adv. Neural Inf. Process. Syst., pages 33330–33342, 2022. 2

  47. [47]

    Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning

    Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning. InProc. IEEE Conf. Com- put. Vis. Pattern Recognit., pages 9415–9424, 2023. 1, 3

  48. [48]

    Point transformer v3: Simpler faster stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4840– 4851, 2024. 2, 3, 5, 6

  49. [49]

    Towards large- scale 3d representation learning with multi-dataset point prompt training

    Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 19551–19562, 2024. 6

  50. [50]

    Sonata: Self- supervised learning of reliable point representations

    Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 22193– 22204, 2025. 1, 3, 5, 6, 12

  51. [51]

    Walk in the cloud: Learning curves for point clouds shape analysis

    Tiange Xiang, Chaoyi Zhang, Yang Song, Jianhui Yu, and Weidong Cai. Walk in the cloud: Learning curves for point clouds shape analysis. InProc. IEEE Int. Conf. Comput. Vis., pages 915–924, 2021. 2

  52. [52]

    Pointcontrast: Unsupervised pre- training for 3d point cloud understanding

    Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InProc. Eur. Conf. Comput. Vis., pages 574–591, 2020. 1, 3 10

  53. [53]

    A unified framework for 3d scene un- derstanding

    Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, and Xiang Bai. A unified framework for 3d scene un- derstanding. InProc. Adv. Neural Inf. Process. Syst., pages 59468–59490, 2024. 2

  54. [54]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProc. IEEE Int. Conf. Comput. Vis., pages 12–22, 2023. 2, 5, 6, 13

  55. [55]

    Point-bert: Pre-training 3d point cloud transformers with masked point modeling

    Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 19313–19322,

  56. [56]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProc. Annual Meeting of the Association for Computational Linguistics, pages 1–9, 2022. 5, 6, 12

  57. [57]

    Sfr: Semantic-aware feature ren- dering of point cloud

    Yaohua Zha, Rongsheng Li, Tao Dai, Jianyu Xiong, Xin Wang, and Shu-Tao Xia. Sfr: Semantic-aware feature ren- dering of point cloud. InProc. Int. Conf. Acoustics, Speech, Signal Process., pages 1–5, 2023. 1

  58. [58]

    Instance-aware dynamic prompt tuning for pre-trained point cloud models

    Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Instance-aware dynamic prompt tuning for pre-trained point cloud models. InProc. IEEE Int. Conf. Comput. Vis., pages 14161–14170, 2023. 1, 3, 5, 6, 8, 12, 13

  59. [59]

    Towards compact 3d representations via point feature enhancement masked au- toencoders

    Yaohua Zha, Huizhen Ji, Jinmin Li, Rongsheng Li, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. Towards compact 3d representations via point feature enhancement masked au- toencoders. InProc. AAAI Conf. Artif. Intell., pages 6962– 6970, 2024. 2

  60. [60]

    Lcm: Locally constrained compact point cloud model for masked point modeling

    Yaohua Zha, Naiqi Li, Yanzi Wang, Tao Dai, Hang Guo, Bin Chen, Zhi Wang, Zhihao Ouyang, and Shu-Tao Xia. Lcm: Locally constrained compact point cloud model for masked point modeling. pages 104816–104842, 2024. 2

  61. [61]

    Pre- training point cloud compact model with partial-aware re- construction.arXiv preprint arXiv:2407.09344, 2024

    Yaohua Zha, Yanzi Wang, Tao Dai, and Shu-Tao Xia. Pre- training point cloud compact model with partial-aware re- construction.arXiv preprint arXiv:2407.09344, 2024. 3

  62. [62]

    Point cloud mixture-of-domain- experts model for 3d self-supervised learning

    Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, and Shu-Tao Xia. Point cloud mixture-of-domain- experts model for 3d self-supervised learning. InProc. Int. Joint Conf. Artif. Intell., pages 2332–2340, 2025. 3

  63. [63]

    Pma: Towards parameter-efficient point cloud understanding via point mamba adapter

    Yaohua Zha, Yanzi Wang, Hang Guo, Jinpeng Wang, Tao Dai, Bin Chen, Zhihao Ouyang, Xue Yuerong, Ke Chen, and Shu-Tao Xia. Pma: Towards parameter-efficient point cloud understanding via point mamba adapter. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 16976–16986, 2025. 3

  64. [64]

    A simple vision transformer for weakly semi-supervised 3d object de- tection

    Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, and Xiang Bai. A simple vision transformer for weakly semi-supervised 3d object de- tection. InProc. IEEE Int. Conf. Comput. Vis., pages 8373– 8383, 2023. 1

  65. [65]

    Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training

    Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. InProc. Adv. Neural Inf. Process. Syst., pages 27061–27074, 2022. 2, 3

  66. [66]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 8552– 8562, 2022. 3

  67. [67]

    Starting from non-parametric net- works for 3d point cloud analysis

    Renrui Zhang, Liuhui Wang, Yali Wang, Peng Gao, Hong- sheng Li, and Jianbo Shi. Starting from non-parametric net- works for 3d point cloud analysis. InProc. IEEE Conf. Com- put. Vis. Pattern Recognit., pages 5344–5353, 2023. 2

  68. [68]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProc. IEEE Int. Conf. Comput. Vis., pages 16259–16268, 2021. 1, 3

  69. [69]

    Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis

    Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, and Xiang Bai. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 14707–14717, 2024. 1, 3, 5, 6, 12, 13

  70. [70]

    Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

    Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. In Proc. IEEE Int. Conf. Comput. Vis., pages 27817–27827,

  71. [71]

    V oxelnet: End-to-end learn- ing for point cloud based 3d object detection

    Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learn- ing for point cloud based 3d object detection. InProc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 4490– 4499, 2018. 3

  72. [72]

    Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning

    Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InProc. IEEE Int. Conf. Comput. Vis., pages 2639– 2650, 2023. 3 11 Supplementary Material S1. Additional Experiments S1.1. Analysis on Different Rank One of our core hyperparame...