arxiv: 2601.07447 · v3 · submitted 2026-01-12 · 💻 cs.CV

Recognition: no theorem link

PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion

Mahdi Chamseddine , Didier Stricker , Jason Rambach

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords panoramic image segmentationSAM encoderspatio-modal fusiondual view fusionspherical attentionsemantic segmentationRGB-D modalitiesStanford2D3DS

0 comments

The pith

Adapting the SAM model with spatio-modal fusion and dual-view attention achieves state-of-the-art semantic segmentation on panoramic images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt a foundation model trained on ordinary photos to work on 360-degree panoramic images for semantic segmentation. It modifies the SAM encoder to produce features at multiple stages and adds a fusion module that picks the best information from different input types like color, depth, and normals. A decoder using spherical attention and dual views then stitches everything together to handle the warping and seams typical in panoramas. If this works, it means existing large models can be repurposed for immersive and robotic vision tasks without starting from scratch. This matters because panoramic data is increasingly used in virtual reality, autonomous navigation, and scene understanding where standard models fail due to distortion.

Core claim

PanoSAMic integrates a modified pre-trained Segment Anything Model encoder into a semantic segmentation pipeline for panoramic images. By outputting multi-stage features and employing a spatio-modal fusion module to select relevant modalities and features, combined with a semantic decoder that uses spherical attention and dual view fusion, the model overcomes distortions and edge discontinuities in spherical images. This results in state-of-the-art performance on the Stanford2D3DS dataset across RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D.

What carries the argument

The spatio-modal fusion module, which selects relevant modalities and best features from each for different areas of the panoramic input, working together with the modified SAM encoder and the dual-view spherical attention decoder.

Load-bearing premise

The assumption that the pre-trained SAM features stay useful after modification and that the new fusion modules can consistently fix the distortions introduced by panoramic projections.

What would settle it

Training and testing the model on a new panoramic dataset with different characteristics, such as indoor scenes not covered in Stanford or Matterport, and checking if performance drops below current leading methods.

Figures

Figures reproduced from arXiv: 2601.07447 by Didier Stricker, Jason Rambach, Mahdi Chamseddine.

**Figure 2.** Figure 2: PanoSAMic architecture. Two views of the same panoramic input are [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Objects in panoramic images that are disconnected on the edges are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Our novel blocks used for feature fusion and dual view fusion. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the qualitative segmentation results of our PanoSAMic [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Some examples of scenes with imprecise ground truth labels from Stan [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the qualitative segmentation results before and after refine [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: More qualitative results on the Stanford2D3DS dataset [1]. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Existing image foundation models are not optimized for spherical images having been trained primarily on perspective images. PanoSAMic integrates the pre-trained Segment Anything (SAM) encoder to make use of its extensive training and integrate it into a semantic segmentation model for panoramic images using multiple modalities. We modify the SAM encoder to output multi-stage features and introduce a novel spatio-modal fusion module that allows the model to select the relevant modalities and best features from each modality for different areas of the input. Furthermore, our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state-of-the-art (SotA) results on Stanford2D3DS for RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D modalities. https://github.com/dfki-av/PanoSAMic

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PanoSAMic, which adapts the pre-trained SAM encoder by modifying it to emit multi-stage features, adds a spatio-modal fusion module to select relevant modalities and features across RGB/RGB-D/RGB-D-N inputs, and employs a semantic decoder with spherical attention plus dual-view fusion to mitigate equirectangular distortions and edge discontinuities. It claims state-of-the-art semantic segmentation results on Stanford2D3DS (RGB, RGB-D, RGB-D-N) and Matterport3D (RGB, RGB-D).

Significance. If the SOTA claims are substantiated with full experimental protocols, ablations, and diagnostics, the work would meaningfully extend foundation-model reuse to panoramic multi-modal segmentation, providing a concrete architecture for handling spherical distortions without retraining the entire encoder from scratch.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the SOTA claim on Stanford2D3DS and Matterport3D is load-bearing yet unsupported by any reported experimental protocol, baseline details, ablation results, or error statistics; without these, it is impossible to verify whether the spatio-modal fusion and dual-view modules deliver the claimed compensation for panoramic distortions rather than dataset-specific fitting.
[§3] §3 (Method): the central assumption that modified SAM multi-stage features preserve pre-trained generality on equirectangular inputs is not directly tested; no diagnostic such as cosine similarity between frozen vs. modified encoder features on distortion-augmented patches or per-boundary mIoU breakdown is provided to confirm the modules mitigate edge discontinuities.

minor comments (2)

[Abstract] The GitHub link is given but should include a requirements file and exact training scripts to support reproducibility claims.
[§3.2] Notation for the spatio-modal fusion weights should be defined explicitly with a small equation or diagram in §3.2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas where additional experimental transparency and diagnostics can strengthen the manuscript. We address each point below and will incorporate the requested clarifications and analyses in the revised version.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim on Stanford2D3DS and Matterport3D is load-bearing yet unsupported by any reported experimental protocol, baseline details, ablation results, or error statistics; without these, it is impossible to verify whether the spatio-modal fusion and dual-view modules deliver the claimed compensation for panoramic distortions rather than dataset-specific fitting.

Authors: We agree that the current presentation of results requires more detailed protocols to substantiate the SOTA claims. In the revision we will expand §4 to include: (1) full training and evaluation protocols (optimizer, learning rate schedule, data augmentation, cross-validation splits); (2) precise descriptions of all baselines with their original citations and our re-implementation details; (3) complete ablation tables isolating the contribution of the spatio-modal fusion module, spherical attention, and dual-view fusion; and (4) error statistics including per-class mIoU, boundary-specific IoU, and standard deviation across multiple runs. These additions will allow readers to confirm that performance gains arise from the proposed modules rather than dataset-specific fitting. revision: yes
Referee: [§3] §3 (Method): the central assumption that modified SAM multi-stage features preserve pre-trained generality on equirectangular inputs is not directly tested; no diagnostic such as cosine similarity between frozen vs. modified encoder features on distortion-augmented patches or per-boundary mIoU breakdown is provided to confirm the modules mitigate edge discontinuities.

Authors: We acknowledge that direct feature-level diagnostics would provide stronger evidence for the preservation of SAM’s generality. In the revised manuscript we will add, either in §3 or a dedicated diagnostics subsection of §4: (1) cosine similarity and feature-map correlation analyses between the frozen original SAM encoder and our multi-stage modified encoder on both perspective and equirectangular inputs (including distortion-augmented patches); (2) per-boundary mIoU breakdowns that isolate performance near the left-right seam and at high-distortion polar regions; and (3) qualitative feature visualizations. These diagnostics will directly test whether the multi-stage adaptation and subsequent modules mitigate edge discontinuities while retaining useful pre-trained representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture extends external pre-trained SAM on public benchmarks

full rationale

The paper describes an engineering architecture that modifies the pre-trained SAM encoder to emit multi-stage features, adds a spatio-modal fusion module, and uses spherical attention plus dual-view fusion in the decoder. No equations, derivations, or self-referential steps are present that reduce any claimed prediction or result to quantities fitted or defined by the authors themselves. All core components rely on an independent external foundation model (SAM) and evaluation occurs on standard public datasets (Stanford2D3DS, Matterport3D) with conventional metrics. This constitutes a self-contained empirical contribution against external benchmarks rather than any closed derivation loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claim rests on transferability of SAM features to panoramic inputs and on the effectiveness of two newly introduced modules whose internal weights are learned from the evaluation datasets.

free parameters (1)

spatio-modal fusion weights
Learned parameters that select modalities and features per region; not enumerated in abstract.

axioms (1)

domain assumption SAM encoder features remain useful after multi-stage extraction for panoramic inputs
Invoked by the decision to integrate the pre-trained encoder without domain-specific pre-training.

invented entities (2)

spatio-modal fusion module no independent evidence
purpose: Select relevant modalities and best features from each modality for different areas
New component introduced to handle multi-modal panoramic inputs.
dual view fusion no independent evidence
purpose: Overcome edge discontinuity in panoramic images
Decoder technique proposed to address spherical distortions.

pith-pipeline@v0.9.0 · 5458 in / 1303 out tokens · 44189 ms · 2026-05-16T14:40:01.122836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d-3d-semantic data for indoor scene understanding. arXiv:1702.01105 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

In: ICCV (2021)

Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., Li, Y.: Shapeconv: Shape- aware convolutional layer for indoor rgb-d semantic segmentation. In: ICCV (2021)

work page 2021
[3]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., et al.: Sam 3: Segment anything with concepts. arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., et al.: Matterport3d: Learning from rgb-d data in indoor environments. arXiv:1709.06158 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

In: CVPR (2020)

Chaplot, D.S., Salakhutdinov, R., Gupta, A., Gupta, S.: Neural topological slam for visual navigation. In: CVPR (2020)

work page 2020
[6]

In: CVPR (2024)

Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In: CVPR (2024)

work page 2024
[7]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

work page 2021
[8]

In: CVPR (2020)

Eder, M., Shvets, M., Lim, J., Frahm, J.M.: Tangent images for mitigating spherical distortion. In: CVPR (2020)

work page 2020
[9]

In: WACV (2024)

Guttikonda, S., Rambach, J.: Single frame semantic segmentation using multi-modal spherical images. In: WACV (2024)

work page 2024
[10]

In: ICLR (2019)

Jiang, C.M., Huang, J., Kashinath, K., Prabhat, Marcus, P., Niessner, M.: Spherical CNNs on unstructured grids. In: ICLR (2019)

work page 2019
[11]

In: CVPRW (2025)

Kanayama, H., Chamseddine, M., Guttikonda, S., Okumura, S., Yokota, S., Stricker, D., Rambach, J.: Tof-360-a panoramic time-of-flight rgb-d dataset for single capture indoor semantic 3d reconstruction. In: CVPRW (2025)

work page 2025
[12]

Kaufmann, F., Chamseddine, M., Guttikonda, S., Glock, C., Stricker, D., Rambach, J.: Ontology-based semantic labeling for rgb-d and point cloud datasets. In: EC3. vol. 4. European Council on Computing in Construction (2023)

work page 2023
[13]

In: ICCV (2023)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., et al.: Segment anything. In: ICCV (2023)

work page 2023
[14]

In: CVPR (2024)

Kweon, H., Yoon, K.J.: From sam to cams: Exploring segment anything model for weakly supervised semantic segmentation. In: CVPR (2024)

work page 2024
[15]

In: IJCAI (2023)

Li, X., Wu, T., Qi, Z., Wang, G., Shan, Y., Li, X.: Sgat4pass: spherical geometry- aware transformer for panoramic semantic segmentation. In: IJCAI (2023)

work page 2023
[16]

In: CVPR (2022)

Li, Y., Guo, Y., Yan, Z., Huang, X., Duan, Y., Ren, L.: Omnifusion: 360 monocular depth estimation via geometry-aware fusion. In: CVPR (2022)

work page 2022
[17]

In: ITSC

Ma, C., Zhang, J., Yang, K., Roitberg, A., Stiefelhagen, R.: Densepass: Dense panoramic semantic segmentation via unsupervised domain adaptation with attention-augmented context exchange. In: ITSC. IEEE (2021)

work page 2021
[18]

Nature Communications15(1) (2024)

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1) (2024)

work page 2024
[19]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., et al.: Learning transferable visual models from natural language supervision. In: ICML. PmLR (2021)

work page 2021
[20]

In: ISVC

Rahman, M.A., Wang, Y.: Optimizing intersection-over-union in deep neural net- works for image segmentation. In: ISVC. Springer (2016)

work page 2016
[21]

In: CVPR (2024) PanoSAMic 15

Shah,U.,Tukur,M.,Alzubaidi,M.,Pintore,G.,Gobbetti,E.,Househ,M.,Schneider, J., et al.: Multipanowise: holistic deep architecture for multi-task dense prediction from a single panoramic image. In: CVPR (2024) PanoSAMic 15

work page 2024
[22]

In: ECCV

Shen, Z., Lin, C., Liao, K., Nie, L., Zheng, Z., Zhao, Y.: Panoformer: panorama transformer for indoor 360◦ depth estimation. In: ECCV. Springer (2022)

work page 2022
[23]

In: CVPR (2021)

Sun, C., Sun, M., Chen, H.T.: Hohonet: 360 indoor holistic understanding with latent horizontal features. In: CVPR (2021)

work page 2021
[24]

In: ECCV (2018)

Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional filters for dense prediction in panoramic images. In: ECCV (2018)

work page 2018
[25]

In: ICMLA

Taubert, O., Götz, M., Schug, A., Streit, A.: Loss scheduling for class-imbalanced image segmentation problems. In: ICMLA. IEEE (2020)

work page 2020
[26]

In: WACV (2024)

Teng, Z., Zhang, J., Yang, K., Peng, K., Shi, H., Reiß, S., Cao, K., et al.: 360bev: Panoramic semantic mapping for indoor bird’s-eye view. In: WACV (2024)

work page 2024
[27]

NeurIPS30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., et al.: Attention is all you need. NeurIPS30(2017)

work page 2017
[28]

In: ECCV (2018)

Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: ECCV (2018)

work page 2018
[29]

arXiv:2106.13731 (2021)

Wright, L., Demeure, N.: Ranger21: a synergistic deep learning optimizer. arXiv:2106.13731 (2021)

work page arXiv 2021
[30]

NeurIPS 34(2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS 34(2021)

work page 2021
[31]

TPAMI44(10) (2021)

Xu, Y., Zhang, Z., Gao, S.: Spherical dnns and their applications in360◦ images and videos. TPAMI44(10) (2021)

work page 2021
[32]

In: CVPR (2024)

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

work page 2024
[33]

arXiv:2306.03908 (2023)

Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: Sam3d: Segment anything in 3d scenes. arXiv:2306.03908 (2023)

work page arXiv 2023
[34]

In: ICRA

Yao, B., Deng, Y., Liu, Y., Chen, H., Li, Y., Yang, Z.: Sam-event-adapter: Adapting segment anything model for event-rgb semantic segmentation. In: ICRA. IEEE (2024)

work page 2024
[35]

In: ICCV (2023)

Zhang, H., Li, F., Zou, X., Liu, S., Li, C., Yang, J., Zhang, L.: A simple framework for open-vocabulary segmentation and detection. In: ICCV (2023)

work page 2023
[36]

T-ITS24(12) (2023)

Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. T-ITS24(12) (2023)

work page 2023
[37]

TPAMI (2024)

Zhang, J., Yang, K., Shi, H., Reiß, S., Peng, K., Ma, C., Fu, H., et al.: Behind every domain there is a shift: Adapting distortion-aware vision transformers for panoramic semantic segmentation. TPAMI (2024)

work page 2024
[38]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Zhang, J., Ma, K., Kapse, S., Saltz, J., Vakalopoulou, M., Prasanna, P., Samaras, D.: Sam-path: A segment anything model for semantic segmentation in digital pathology. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2023)

work page 2023
[39]

In: ECCV

Zheng, J., Liu, R., Chen, Y., Peng, K., Wu, C., Yang, K., Zhang, J., et al.: Open panoramic segmentation. In: ECCV. Springer (2024)

work page 2024
[40]

In: WACV (2023)

Zheng, Z., Lin, C., Nie, L., Liao, K., Shen, Z., Zhao, Y.: Complementary bi- directional feature compression for indoor 360deg semantic segmentation with self-distillation. In: WACV (2023)

work page 2023
[41]

arXiv:2508.03490 (2025)

Zhou, Y., Thielmann, P., Chamoli, A., Mirbach, B., Stricker, D., Rambach, J.: Particlesam: Small particle segmentation for material quality monitoring in recycling processes. arXiv:2508.03490 (2025)

work page arXiv 2025
[42]

blobiness

Zhuang, C., Lu, Z., Wang, Y., Xiao, J., Wang, Y.: Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. In: AAAI. vol. 36 (2022) PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion Supplementary Material Mahdi Chamseddine1,2, Didier Stricker1,2, and Jason Rambach1 1 German Research Cent...

work page 2022