RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation
Pith reviewed 2026-06-29 22:03 UTC · model grok-4.3
The pith
RepSAM adapts SAM to robotic vision by allocating fine-tuning ranks according to layer-wise CKA representation shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that non-uniform representation shifts, shown by CKA below 0.5 in shallow layers and above 0.7 in deep layers, are the main source of the domain gap when applying SAM to robotics, and that a CKA-guided rank allocation strategy inside a parameter-efficient fine-tuning framework plus a multi-modal fusion module closes most of that gap, delivering 89.0 percent mIoU (97.9 percent of full fine-tuning) with only 4 million trainable parameters.
What carries the argument
CKA-guided rank allocation, which measures centered kernel alignment between source and target representations per layer to decide how many adaptation parameters each layer receives.
Load-bearing premise
The non-uniform CKA shifts across layers are the primary cause of the domain gap and allocating ranks according to those shifts will reliably reduce the gap in robotic settings.
What would settle it
If new robotic datasets show that uniform or random rank allocation produces equal or higher mIoU than the CKA-guided version, or if CKA values stop correlating with the layers that most need adaptation, the central mechanism would be falsified.
Figures
read the original abstract
Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that non-uniform representation shifts in SAM (CKA <0.5 in shallow layers, >0.7 in deep layers) cause domain gaps in robotic vision. It proposes RepSAM, a PEFT framework using a CKA-guided rank allocation strategy plus a multi-modal fusion module. On six benchmarks and manipulation tasks, RepSAM achieves 89.0% mIoU (97.9% of full fine-tuning's 90.9%) with 4.0M trainable parameters (158x reduction from 632M), outperforms DoRA by 7.9% mIoU with 96x less training time (4 vs 384 GPU-hours), yields 12% higher manipulation success rates than LoRA (RGB), and reports statistical significance at p<0.01.
Significance. If the attribution of gains specifically to the CKA-guided allocation can be isolated and the method generalizes, this would provide a practical route to adapting large vision foundation models to robotics with dramatically reduced parameters and compute, which is relevant for real-world deployment in unstructured environments.
major comments (3)
- [Abstract / Experimental evaluation] Abstract / Experimental evaluation: The performance numbers (89.0% mIoU, 97.9% of full fine-tuning, 7.9% over DoRA) are reported only for the full RepSAM system that includes both CKA-guided allocation and the multi-modal fusion module. No ablation is described that holds the fusion module fixed while comparing CKA-guided ranks against uniform ranks or DoRA-style allocation. This prevents isolating whether the gains stem from the representation-guided strategy or from the fusion component.
- [Methods] Methods (implied by abstract): The CKA-guided rank allocation is presented as 'theoretically grounded' based on the observed non-uniform CKA shifts, yet the abstract provides no equations, derivation, or explicit procedure for mapping measured CKA values to rank allocations. It is therefore unclear whether the allocation is determined solely from source-target CKA measurements or involves additional choices that could make results sensitive to implementation details.
- [Abstract] Abstract: No details are given on the datasets or images used to compute the motivating CKA values, nor any verification that the reported thresholds (shallow <0.5, deep >0.7) are not post-hoc or specific to the evaluation benchmarks. This leaves open whether the non-uniform shifts are the primary, generalizable cause of the domain gap.
minor comments (2)
- [Abstract] Abstract: The specific identities of the six benchmarks and the robotic manipulation tasks are not named, hindering assessment of scope and reproducibility.
- [Abstract] Abstract: The procedure used to establish statistical significance (p<0.01), including number of runs and exact test, is not described.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need to better isolate and document the contributions of the CKA-guided allocation. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experimental evaluation] The performance numbers (89.0% mIoU, 97.9% of full fine-tuning, 7.9% over DoRA) are reported only for the full RepSAM system that includes both CKA-guided allocation and the multi-modal fusion module. No ablation is described that holds the fusion module fixed while comparing CKA-guided ranks against uniform ranks or DoRA-style allocation. This prevents isolating whether the gains stem from the representation-guided strategy or from the fusion component.
Authors: We agree that an ablation isolating the CKA-guided allocation (with the fusion module held fixed) would strengthen attribution of the gains. In the revised manuscript we will add this experiment, reporting mIoU for CKA-guided ranks versus uniform ranks and versus DoRA-style allocation under identical fusion settings and total parameter budgets. revision: yes
-
Referee: [Methods] The CKA-guided rank allocation is presented as 'theoretically grounded' based on the observed non-uniform CKA shifts, yet the abstract provides no equations, derivation, or explicit procedure for mapping measured CKA values to rank allocations. It is therefore unclear whether the allocation is determined solely from source-target CKA measurements or involves additional choices that could make results sensitive to implementation details.
Authors: Section 3.2 describes the allocation as rank_i = round( r_max * (1 - CKA_i) / sum(1 - CKA) ) subject to a fixed total parameter budget, using only the per-layer source-target CKA values. No additional learned hyperparameters are introduced. To improve clarity we will insert the explicit mapping formula and a short derivation into the abstract and ensure the Methods section states that allocation depends solely on the measured CKA matrix. revision: partial
-
Referee: [Abstract] No details are given on the datasets or images used to compute the motivating CKA values, nor any verification that the reported thresholds (shallow <0.5, deep >0.7) are not post-hoc or specific to the evaluation benchmarks. This leaves open whether the non-uniform shifts are the primary, generalizable cause of the domain gap.
Authors: The CKA statistics were obtained on a held-out validation set of 500 robotic images drawn from the same distribution as the training data but disjoint from all test benchmarks; the shallow/deep thresholds were first observed on an independent collection of manipulation and cluttered-scene images before any final evaluation. We will add this description, the exact image counts, and a supplementary figure of layer-wise CKA across multiple robotic datasets to demonstrate that the non-uniform pattern is not benchmark-specific. revision: yes
Circularity Check
No significant circularity; method rests on empirical CKA observation
full rationale
The paper attributes domain gap to measured non-uniform CKA values (shallow <0.5, deep >0.7) and proposes CKA-guided rank allocation as a direct response to those measurements. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The rank allocation is presented as an empirical heuristic informed by standard CKA similarity, not derived from or equivalent to the performance numbers by construction. Experimental results (mIoU, parameter counts) are reported separately from the motivating observation and do not reduce to it tautologically.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CKA is a reliable and sufficient metric for quantifying representation shifts across transformer layers in this domain adaptation setting
Reference graph
Works this paper leans on
-
[1]
SAM-Adapter: Adapting seg- ment anything in underperformed scenes
[Chenet al., 2023 ] Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. SAM-Adapter: Adapting seg- ment anything in underperformed scenes. InICCV Workshops, pages 3359–3367,
2023
-
[2]
RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–17,
[Chenet al., 2024 ] Keyan Chen, Chenyang Liu, Hao Chen, Hao- tian Zhang, Wenyuan Li, Zhengxia Zou, and Zhenwei Shi. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–17,
2024
-
[3]
Segmenting unknown 3D objects from real depth images using Mask R-CNN trained on synthetic data
[Danielczuket al., 2019 ] Michael Danielczuk, Matthew Matl, Saurabh Gupta, Andrew Li, Andrew Lee, Jeffrey Mahler, and Ken Goldberg. Segmenting unknown 3D objects from real depth images using Mask R-CNN trained on synthetic data. InICRA, pages 7283–7290,
2019
-
[4]
GraspNet-1Billion: A large-scale benchmark for general object grasping
[Fanget al., 2020 ] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. GraspNet-1Billion: A large-scale benchmark for general object grasping. InCVPR, pages 11444–11453,
2020
-
[5]
Foun- dation models in robotics: Applications, challenges, and the fu- ture.International Journal of Robotics Research, 44(5):701–739,
[Firooziet al., 2025 ] Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foun- dation models in robotics: Applications, challenges, and the fu- ture.International Journal of Robotics Research, 44(5):701–739,
2025
-
[6]
La-LoRA: Parameter-efficient fine-tuning with layer-wise adaptive low-rank adaptation.Neural Networks, 194:108095,
[Guet al., 2026 ] Jiancheng Gu, Jiabin Yuan, Jiyuan Cai, Xianfa Zhou, and Lili Fan. La-LoRA: Parameter-efficient fine-tuning with layer-wise adaptive low-rank adaptation.Neural Networks, 194:108095,
2026
-
[7]
Model based training, detection and pose estima- tion of texture-less 3D objects in heavily cluttered scenes
[Hinterstoisseret al., 2012 ] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estima- tion of texture-less 3D objects in heavily cluttered scenes. In ACCV, pages 548–562,
2012
-
[8]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
[Huet al., 2022 ] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,
2022
-
[9]
Segment anything in high quality
[Keet al., 2023 ] Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InNeurIPS, volume 36,
2023
-
[10]
Berg, Wan-Yen Lo, Piotr Dol- lár, and Ross Girshick
[Kirillovet al., 2023 ] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dol- lár, and Ross Girshick. Segment anything. InICCV, pages 4015– 4026,
2023
-
[11]
Kopiczko, Tijmen Blankevoort, and Yuki M
[Kopiczkoet al., 2024 ] Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based random matrix adapta- tion. InICLR,
2024
-
[12]
Similarity of neural network representations revisited
[Kornblithet al., 2019 ] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, pages 3519–3529,
2019
-
[13]
DoRA: Weight-decomposed low-rank adaptation
[Liuet al., 2024 ] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. InICML, pages 32100–32121,
2024
-
[14]
Im- proving SAM for camouflaged object detection via dual stream adapters
[Liuet al., 2025 ] Jiaming Liu, Linghe Kong, and Guihai Chen. Im- proving SAM for camouflaged object detection via dual stream adapters. InICCV, pages 21906–21916,
2025
-
[15]
Segment anything in medical images.Nature Communications, 15(1):654,
[Maet al., 2024 ] Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654,
2024
-
[16]
Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics
[Mahleret al., 2017 ] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In RSS,
2017
-
[17]
6-DOF GraspNet: Variational grasp generation for object manipulation
[Mousavianet al., 2019 ] Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-DOF GraspNet: Variational grasp generation for object manipulation. InICCV, pages 2901–2910,
2019
-
[18]
SAM 2: Segment Anything in Images and Videos
[Raviet al., 2024 ] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, and Shuran Song
[Sajjanet al., 2020 ] Shreeyak S. Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, and Shuran Song. ClearGrasp: 3D shape estimation of transparent objects for manipulation. InICRA, pages 3634–3642,
2020
-
[20]
EasyLabel: A semi-automatic pixel-wise object annotation tool for creating robotic RGB-D datasets
[Suchiet al., 2019 ] Markus Suchi, Timothy Patten, David Fischinger, and Markus Vincze. EasyLabel: A semi-automatic pixel-wise object annotation tool for creating robotic RGB-D datasets. InICRA, pages 6678–6684,
2019
-
[21]
PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes
[Xianget al., 2018 ] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In RSS,
2018
-
[22]
EfficientSAM: Leveraged masked image pretraining for efficient segment anything
[Xionget al., 2024 ] Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoor- thi, and Vikas Chandra. EfficientSAM: Leveraged masked image pretraining for efficient segment anything. InCVPR,
2024
-
[23]
[Zenget al., 2018 ] Andy Zeng, Shuran Song, Kuan-Ting Yu, El- liott Donlon, Francois R. Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, and Alberto Rodriguez. Robotic pick-and-place of ...
2018
-
[24]
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
[Zhanget al., 2023a ] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight SAM for mobile applications.arXiv:2306.14289,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
[Zhaoet al., 2023 ] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv:2306.12156,
-
[26]
Ga- Lore: Memory-efficient LLM training by gradient low-rank pro- jection
[Zhaoet al., 2024 ] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Ga- Lore: Memory-efficient LLM training by gradient low-rank pro- jection. InICML, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.