Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

· 2026 · cs.CV · arXiv 2602.23069

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Diffusion Masked Pretraining for Dynamic Point Cloud

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

DiMP uses diffusion to infer clean masked positions from visible context and to model full distributions of point displacements rather than means, delivering 11.21% and 13.65% absolute gains on offline and online action segmentation.

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

Mantis is the first Mamba-native PEFT framework for 3D point cloud models, using state-aware adapters and dual-serialization distillation to match performance with only 5% trainable parameters.

CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

citing papers explorer

Showing 3 of 3 citing papers.

Diffusion Masked Pretraining for Dynamic Point Cloud cs.CV · 2026-05-05 · unverdicted · none · ref 13 · 2 links · internal anchor
DiMP uses diffusion to infer clean masked positions from visible context and to model full distributions of point displacements rather than means, delivering 11.21% and 13.65% absolute gains on offline and online action segmentation.
Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models cs.CV · 2026-05-05 · unverdicted · none · ref 34 · 2 links · internal anchor
Mantis is the first Mamba-native PEFT framework for 3D point cloud models, using state-aware adapters and dual-serialization distillation to match performance with only 5% trainable parameters.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning cs.AI · 2026-04-13 · unverdicted · none · ref 24 · internal anchor
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer