pith. sign in

arxiv: 2605.25495 · v1 · pith:DMAJ73ZYnew · submitted 2026-05-25 · 💻 cs.RO · cs.CV

RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation

Pith reviewed 2026-06-29 22:03 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords RepSAMSAM adaptationparameter-efficient fine-tuningrobotic visionCKArepresentation shiftsmulti-modal fusiondomain adaptation
0
0 comments X

The pith

RepSAM adapts SAM to robotic vision by allocating fine-tuning ranks according to layer-wise CKA representation shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that SAM loses accuracy on robotic tasks because shallow transformer layers shift their representations more than deep layers when the domain changes. Measuring those shifts with CKA and then assigning more trainable low-rank parameters to the layers that shift most recovers nearly all of the accuracy that full fine-tuning would provide. This matters for robotics because the approach cuts trainable parameters from 632 million down to 4 million and training time from hundreds of GPU-hours to a few, while also adding a fusion step that helps with transparent objects and clutter. If the link between CKA values and required rank holds, large pre-trained models become practical starting points for real robot perception without massive retraining.

Core claim

The central claim is that non-uniform representation shifts, shown by CKA below 0.5 in shallow layers and above 0.7 in deep layers, are the main source of the domain gap when applying SAM to robotics, and that a CKA-guided rank allocation strategy inside a parameter-efficient fine-tuning framework plus a multi-modal fusion module closes most of that gap, delivering 89.0 percent mIoU (97.9 percent of full fine-tuning) with only 4 million trainable parameters.

What carries the argument

CKA-guided rank allocation, which measures centered kernel alignment between source and target representations per layer to decide how many adaptation parameters each layer receives.

Load-bearing premise

The non-uniform CKA shifts across layers are the primary cause of the domain gap and allocating ranks according to those shifts will reliably reduce the gap in robotic settings.

What would settle it

If new robotic datasets show that uniform or random rank allocation produces equal or higher mIoU than the CKA-guided version, or if CKA values stop correlating with the layers that most need adaptation, the central mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.25495 by Wenhui Chu.

Figure 1
Figure 1. Figure 1: Overview of the RepSAM architecture. We freeze SAM’s ViT-H encoder (gray) and add: (1) Multi-Scale LoRA with CKA-guided [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CKA analysis reveals non-uniform transferability. Shal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison across methods. RepSAM [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. (a) OCID: transparent/cluttered objects; (b) ClearGrasp: transparent bottles/glasses; (c) GraspNet: extreme [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that non-uniform representation shifts in SAM (CKA <0.5 in shallow layers, >0.7 in deep layers) cause domain gaps in robotic vision. It proposes RepSAM, a PEFT framework using a CKA-guided rank allocation strategy plus a multi-modal fusion module. On six benchmarks and manipulation tasks, RepSAM achieves 89.0% mIoU (97.9% of full fine-tuning's 90.9%) with 4.0M trainable parameters (158x reduction from 632M), outperforms DoRA by 7.9% mIoU with 96x less training time (4 vs 384 GPU-hours), yields 12% higher manipulation success rates than LoRA (RGB), and reports statistical significance at p<0.01.

Significance. If the attribution of gains specifically to the CKA-guided allocation can be isolated and the method generalizes, this would provide a practical route to adapting large vision foundation models to robotics with dramatically reduced parameters and compute, which is relevant for real-world deployment in unstructured environments.

major comments (3)
  1. [Abstract / Experimental evaluation] Abstract / Experimental evaluation: The performance numbers (89.0% mIoU, 97.9% of full fine-tuning, 7.9% over DoRA) are reported only for the full RepSAM system that includes both CKA-guided allocation and the multi-modal fusion module. No ablation is described that holds the fusion module fixed while comparing CKA-guided ranks against uniform ranks or DoRA-style allocation. This prevents isolating whether the gains stem from the representation-guided strategy or from the fusion component.
  2. [Methods] Methods (implied by abstract): The CKA-guided rank allocation is presented as 'theoretically grounded' based on the observed non-uniform CKA shifts, yet the abstract provides no equations, derivation, or explicit procedure for mapping measured CKA values to rank allocations. It is therefore unclear whether the allocation is determined solely from source-target CKA measurements or involves additional choices that could make results sensitive to implementation details.
  3. [Abstract] Abstract: No details are given on the datasets or images used to compute the motivating CKA values, nor any verification that the reported thresholds (shallow <0.5, deep >0.7) are not post-hoc or specific to the evaluation benchmarks. This leaves open whether the non-uniform shifts are the primary, generalizable cause of the domain gap.
minor comments (2)
  1. [Abstract] Abstract: The specific identities of the six benchmarks and the robotic manipulation tasks are not named, hindering assessment of scope and reproducibility.
  2. [Abstract] Abstract: The procedure used to establish statistical significance (p<0.01), including number of runs and exact test, is not described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to better isolate and document the contributions of the CKA-guided allocation. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental evaluation] The performance numbers (89.0% mIoU, 97.9% of full fine-tuning, 7.9% over DoRA) are reported only for the full RepSAM system that includes both CKA-guided allocation and the multi-modal fusion module. No ablation is described that holds the fusion module fixed while comparing CKA-guided ranks against uniform ranks or DoRA-style allocation. This prevents isolating whether the gains stem from the representation-guided strategy or from the fusion component.

    Authors: We agree that an ablation isolating the CKA-guided allocation (with the fusion module held fixed) would strengthen attribution of the gains. In the revised manuscript we will add this experiment, reporting mIoU for CKA-guided ranks versus uniform ranks and versus DoRA-style allocation under identical fusion settings and total parameter budgets. revision: yes

  2. Referee: [Methods] The CKA-guided rank allocation is presented as 'theoretically grounded' based on the observed non-uniform CKA shifts, yet the abstract provides no equations, derivation, or explicit procedure for mapping measured CKA values to rank allocations. It is therefore unclear whether the allocation is determined solely from source-target CKA measurements or involves additional choices that could make results sensitive to implementation details.

    Authors: Section 3.2 describes the allocation as rank_i = round( r_max * (1 - CKA_i) / sum(1 - CKA) ) subject to a fixed total parameter budget, using only the per-layer source-target CKA values. No additional learned hyperparameters are introduced. To improve clarity we will insert the explicit mapping formula and a short derivation into the abstract and ensure the Methods section states that allocation depends solely on the measured CKA matrix. revision: partial

  3. Referee: [Abstract] No details are given on the datasets or images used to compute the motivating CKA values, nor any verification that the reported thresholds (shallow <0.5, deep >0.7) are not post-hoc or specific to the evaluation benchmarks. This leaves open whether the non-uniform shifts are the primary, generalizable cause of the domain gap.

    Authors: The CKA statistics were obtained on a held-out validation set of 500 robotic images drawn from the same distribution as the training data but disjoint from all test benchmarks; the shallow/deep thresholds were first observed on an independent collection of manipulation and cluttered-scene images before any final evaluation. We will add this description, the exact image counts, and a supplementary figure of layer-wise CKA across multiple robotic datasets to demonstrate that the non-uniform pattern is not benchmark-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method rests on empirical CKA observation

full rationale

The paper attributes domain gap to measured non-uniform CKA values (shallow <0.5, deep >0.7) and proposes CKA-guided rank allocation as a direct response to those measurements. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The rank allocation is presented as an empirical heuristic informed by standard CKA similarity, not derived from or equivalent to the performance numbers by construction. Experimental results (mIoU, parameter counts) are reported separately from the motivating observation and do not reduce to it tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate free parameters or invented entities; the approach relies on CKA as a similarity measure and the assumption that layer-wise domain gaps are the dominant issue.

axioms (1)
  • domain assumption CKA is a reliable and sufficient metric for quantifying representation shifts across transformer layers in this domain adaptation setting
    Used to identify shallow vs deep layer differences and to guide rank allocation.

pith-pipeline@v0.9.1-grok · 5771 in / 1292 out tokens · 32868 ms · 2026-06-29T22:03:32.608930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    SAM-Adapter: Adapting seg- ment anything in underperformed scenes

    [Chenet al., 2023 ] Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. SAM-Adapter: Adapting seg- ment anything in underperformed scenes. InICCV Workshops, pages 3359–3367,

  2. [2]

    RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–17,

    [Chenet al., 2024 ] Keyan Chen, Chenyang Liu, Hao Chen, Hao- tian Zhang, Wenyuan Li, Zhengxia Zou, and Zhenwei Shi. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–17,

  3. [3]

    Segmenting unknown 3D objects from real depth images using Mask R-CNN trained on synthetic data

    [Danielczuket al., 2019 ] Michael Danielczuk, Matthew Matl, Saurabh Gupta, Andrew Li, Andrew Lee, Jeffrey Mahler, and Ken Goldberg. Segmenting unknown 3D objects from real depth images using Mask R-CNN trained on synthetic data. InICRA, pages 7283–7290,

  4. [4]

    GraspNet-1Billion: A large-scale benchmark for general object grasping

    [Fanget al., 2020 ] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. GraspNet-1Billion: A large-scale benchmark for general object grasping. InCVPR, pages 11444–11453,

  5. [5]

    Foun- dation models in robotics: Applications, challenges, and the fu- ture.International Journal of Robotics Research, 44(5):701–739,

    [Firooziet al., 2025 ] Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foun- dation models in robotics: Applications, challenges, and the fu- ture.International Journal of Robotics Research, 44(5):701–739,

  6. [6]

    La-LoRA: Parameter-efficient fine-tuning with layer-wise adaptive low-rank adaptation.Neural Networks, 194:108095,

    [Guet al., 2026 ] Jiancheng Gu, Jiabin Yuan, Jiyuan Cai, Xianfa Zhou, and Lili Fan. La-LoRA: Parameter-efficient fine-tuning with layer-wise adaptive low-rank adaptation.Neural Networks, 194:108095,

  7. [7]

    Model based training, detection and pose estima- tion of texture-less 3D objects in heavily cluttered scenes

    [Hinterstoisseret al., 2012 ] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estima- tion of texture-less 3D objects in heavily cluttered scenes. In ACCV, pages 548–562,

  8. [8]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    [Huet al., 2022 ] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,

  9. [9]

    Segment anything in high quality

    [Keet al., 2023 ] Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InNeurIPS, volume 36,

  10. [10]

    Berg, Wan-Yen Lo, Piotr Dol- lár, and Ross Girshick

    [Kirillovet al., 2023 ] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dol- lár, and Ross Girshick. Segment anything. InICCV, pages 4015– 4026,

  11. [11]

    Kopiczko, Tijmen Blankevoort, and Yuki M

    [Kopiczkoet al., 2024 ] Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based random matrix adapta- tion. InICLR,

  12. [12]

    Similarity of neural network representations revisited

    [Kornblithet al., 2019 ] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, pages 3519–3529,

  13. [13]

    DoRA: Weight-decomposed low-rank adaptation

    [Liuet al., 2024 ] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. InICML, pages 32100–32121,

  14. [14]

    Im- proving SAM for camouflaged object detection via dual stream adapters

    [Liuet al., 2025 ] Jiaming Liu, Linghe Kong, and Guihai Chen. Im- proving SAM for camouflaged object detection via dual stream adapters. InICCV, pages 21906–21916,

  15. [15]

    Segment anything in medical images.Nature Communications, 15(1):654,

    [Maet al., 2024 ] Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654,

  16. [16]

    Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

    [Mahleret al., 2017 ] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In RSS,

  17. [17]

    6-DOF GraspNet: Variational grasp generation for object manipulation

    [Mousavianet al., 2019 ] Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-DOF GraspNet: Variational grasp generation for object manipulation. InICCV, pages 2901–2910,

  18. [18]

    SAM 2: Segment Anything in Images and Videos

    [Raviet al., 2024 ] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv...

  19. [19]

    Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, and Shuran Song

    [Sajjanet al., 2020 ] Shreeyak S. Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, and Shuran Song. ClearGrasp: 3D shape estimation of transparent objects for manipulation. InICRA, pages 3634–3642,

  20. [20]

    EasyLabel: A semi-automatic pixel-wise object annotation tool for creating robotic RGB-D datasets

    [Suchiet al., 2019 ] Markus Suchi, Timothy Patten, David Fischinger, and Markus Vincze. EasyLabel: A semi-automatic pixel-wise object annotation tool for creating robotic RGB-D datasets. InICRA, pages 6678–6684,

  21. [21]

    PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes

    [Xianget al., 2018 ] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In RSS,

  22. [22]

    EfficientSAM: Leveraged masked image pretraining for efficient segment anything

    [Xionget al., 2024 ] Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoor- thi, and Vikas Chandra. EfficientSAM: Leveraged masked image pretraining for efficient segment anything. InCVPR,

  23. [23]

    [Zenget al., 2018 ] Andy Zeng, Shuran Song, Kuan-Ting Yu, El- liott Donlon, Francois R. Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, and Alberto Rodriguez. Robotic pick-and-place of ...

  24. [24]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    [Zhanget al., 2023a ] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight SAM for mobile applications.arXiv:2306.14289,

  25. [25]

    Fast segment anything,

    [Zhaoet al., 2023 ] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv:2306.12156,

  26. [26]

    Ga- Lore: Memory-efficient LLM training by gradient low-rank pro- jection

    [Zhaoet al., 2024 ] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Ga- Lore: Memory-efficient LLM training by gradient low-rank pro- jection. InICML, 2024