Recognition: no theorem link
PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion
Pith reviewed 2026-05-16 14:40 UTC · model grok-4.3
The pith
Adapting the SAM model with spatio-modal fusion and dual-view attention achieves state-of-the-art semantic segmentation on panoramic images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PanoSAMic integrates a modified pre-trained Segment Anything Model encoder into a semantic segmentation pipeline for panoramic images. By outputting multi-stage features and employing a spatio-modal fusion module to select relevant modalities and features, combined with a semantic decoder that uses spherical attention and dual view fusion, the model overcomes distortions and edge discontinuities in spherical images. This results in state-of-the-art performance on the Stanford2D3DS dataset across RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D.
What carries the argument
The spatio-modal fusion module, which selects relevant modalities and best features from each for different areas of the panoramic input, working together with the modified SAM encoder and the dual-view spherical attention decoder.
Load-bearing premise
The assumption that the pre-trained SAM features stay useful after modification and that the new fusion modules can consistently fix the distortions introduced by panoramic projections.
What would settle it
Training and testing the model on a new panoramic dataset with different characteristics, such as indoor scenes not covered in Stanford or Matterport, and checking if performance drops below current leading methods.
Figures
read the original abstract
Existing image foundation models are not optimized for spherical images having been trained primarily on perspective images. PanoSAMic integrates the pre-trained Segment Anything (SAM) encoder to make use of its extensive training and integrate it into a semantic segmentation model for panoramic images using multiple modalities. We modify the SAM encoder to output multi-stage features and introduce a novel spatio-modal fusion module that allows the model to select the relevant modalities and best features from each modality for different areas of the input. Furthermore, our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state-of-the-art (SotA) results on Stanford2D3DS for RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D modalities. https://github.com/dfki-av/PanoSAMic
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PanoSAMic, which adapts the pre-trained SAM encoder by modifying it to emit multi-stage features, adds a spatio-modal fusion module to select relevant modalities and features across RGB/RGB-D/RGB-D-N inputs, and employs a semantic decoder with spherical attention plus dual-view fusion to mitigate equirectangular distortions and edge discontinuities. It claims state-of-the-art semantic segmentation results on Stanford2D3DS (RGB, RGB-D, RGB-D-N) and Matterport3D (RGB, RGB-D).
Significance. If the SOTA claims are substantiated with full experimental protocols, ablations, and diagnostics, the work would meaningfully extend foundation-model reuse to panoramic multi-modal segmentation, providing a concrete architecture for handling spherical distortions without retraining the entire encoder from scratch.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim on Stanford2D3DS and Matterport3D is load-bearing yet unsupported by any reported experimental protocol, baseline details, ablation results, or error statistics; without these, it is impossible to verify whether the spatio-modal fusion and dual-view modules deliver the claimed compensation for panoramic distortions rather than dataset-specific fitting.
- [§3] §3 (Method): the central assumption that modified SAM multi-stage features preserve pre-trained generality on equirectangular inputs is not directly tested; no diagnostic such as cosine similarity between frozen vs. modified encoder features on distortion-augmented patches or per-boundary mIoU breakdown is provided to confirm the modules mitigate edge discontinuities.
minor comments (2)
- [Abstract] The GitHub link is given but should include a requirements file and exact training scripts to support reproducibility claims.
- [§3.2] Notation for the spatio-modal fusion weights should be defined explicitly with a small equation or diagram in §3.2.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important areas where additional experimental transparency and diagnostics can strengthen the manuscript. We address each point below and will incorporate the requested clarifications and analyses in the revised version.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the SOTA claim on Stanford2D3DS and Matterport3D is load-bearing yet unsupported by any reported experimental protocol, baseline details, ablation results, or error statistics; without these, it is impossible to verify whether the spatio-modal fusion and dual-view modules deliver the claimed compensation for panoramic distortions rather than dataset-specific fitting.
Authors: We agree that the current presentation of results requires more detailed protocols to substantiate the SOTA claims. In the revision we will expand §4 to include: (1) full training and evaluation protocols (optimizer, learning rate schedule, data augmentation, cross-validation splits); (2) precise descriptions of all baselines with their original citations and our re-implementation details; (3) complete ablation tables isolating the contribution of the spatio-modal fusion module, spherical attention, and dual-view fusion; and (4) error statistics including per-class mIoU, boundary-specific IoU, and standard deviation across multiple runs. These additions will allow readers to confirm that performance gains arise from the proposed modules rather than dataset-specific fitting. revision: yes
-
Referee: [§3] §3 (Method): the central assumption that modified SAM multi-stage features preserve pre-trained generality on equirectangular inputs is not directly tested; no diagnostic such as cosine similarity between frozen vs. modified encoder features on distortion-augmented patches or per-boundary mIoU breakdown is provided to confirm the modules mitigate edge discontinuities.
Authors: We acknowledge that direct feature-level diagnostics would provide stronger evidence for the preservation of SAM’s generality. In the revised manuscript we will add, either in §3 or a dedicated diagnostics subsection of §4: (1) cosine similarity and feature-map correlation analyses between the frozen original SAM encoder and our multi-stage modified encoder on both perspective and equirectangular inputs (including distortion-augmented patches); (2) per-boundary mIoU breakdowns that isolate performance near the left-right seam and at high-distortion polar regions; and (3) qualitative feature visualizations. These diagnostics will directly test whether the multi-stage adaptation and subsequent modules mitigate edge discontinuities while retaining useful pre-trained representations. revision: yes
Circularity Check
No significant circularity; architecture extends external pre-trained SAM on public benchmarks
full rationale
The paper describes an engineering architecture that modifies the pre-trained SAM encoder to emit multi-stage features, adds a spatio-modal fusion module, and uses spherical attention plus dual-view fusion in the decoder. No equations, derivations, or self-referential steps are present that reduce any claimed prediction or result to quantities fitted or defined by the authors themselves. All core components rely on an independent external foundation model (SAM) and evaluation occurs on standard public datasets (Stanford2D3DS, Matterport3D) with conventional metrics. This constitutes a self-contained empirical contribution against external benchmarks rather than any closed derivation loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- spatio-modal fusion weights
axioms (1)
- domain assumption SAM encoder features remain useful after multi-stage extraction for panoramic inputs
invented entities (2)
-
spatio-modal fusion module
no independent evidence
-
dual view fusion
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Joint 2D-3D-Semantic Data for Indoor Scene Understanding
Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d-3d-semantic data for indoor scene understanding. arXiv:1702.01105 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., Li, Y.: Shapeconv: Shape- aware convolutional layer for indoor rgb-d semantic segmentation. In: ICCV (2021)
work page 2021
-
[3]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., et al.: Sam 3: Segment anything with concepts. arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., et al.: Matterport3d: Learning from rgb-d data in indoor environments. arXiv:1709.06158 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Chaplot, D.S., Salakhutdinov, R., Gupta, A., Gupta, S.: Neural topological slam for visual navigation. In: CVPR (2020)
work page 2020
-
[6]
Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In: CVPR (2024)
work page 2024
-
[7]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
work page 2021
-
[8]
Eder, M., Shvets, M., Lim, J., Frahm, J.M.: Tangent images for mitigating spherical distortion. In: CVPR (2020)
work page 2020
-
[9]
Guttikonda, S., Rambach, J.: Single frame semantic segmentation using multi-modal spherical images. In: WACV (2024)
work page 2024
-
[10]
Jiang, C.M., Huang, J., Kashinath, K., Prabhat, Marcus, P., Niessner, M.: Spherical CNNs on unstructured grids. In: ICLR (2019)
work page 2019
-
[11]
Kanayama, H., Chamseddine, M., Guttikonda, S., Okumura, S., Yokota, S., Stricker, D., Rambach, J.: Tof-360-a panoramic time-of-flight rgb-d dataset for single capture indoor semantic 3d reconstruction. In: CVPRW (2025)
work page 2025
-
[12]
Kaufmann, F., Chamseddine, M., Guttikonda, S., Glock, C., Stricker, D., Rambach, J.: Ontology-based semantic labeling for rgb-d and point cloud datasets. In: EC3. vol. 4. European Council on Computing in Construction (2023)
work page 2023
-
[13]
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., et al.: Segment anything. In: ICCV (2023)
work page 2023
-
[14]
Kweon, H., Yoon, K.J.: From sam to cams: Exploring segment anything model for weakly supervised semantic segmentation. In: CVPR (2024)
work page 2024
-
[15]
Li, X., Wu, T., Qi, Z., Wang, G., Shan, Y., Li, X.: Sgat4pass: spherical geometry- aware transformer for panoramic semantic segmentation. In: IJCAI (2023)
work page 2023
-
[16]
Li, Y., Guo, Y., Yan, Z., Huang, X., Duan, Y., Ren, L.: Omnifusion: 360 monocular depth estimation via geometry-aware fusion. In: CVPR (2022)
work page 2022
- [17]
-
[18]
Nature Communications15(1) (2024)
Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1) (2024)
work page 2024
- [19]
- [20]
-
[21]
Shah,U.,Tukur,M.,Alzubaidi,M.,Pintore,G.,Gobbetti,E.,Househ,M.,Schneider, J., et al.: Multipanowise: holistic deep architecture for multi-task dense prediction from a single panoramic image. In: CVPR (2024) PanoSAMic 15
work page 2024
- [22]
-
[23]
Sun, C., Sun, M., Chen, H.T.: Hohonet: 360 indoor holistic understanding with latent horizontal features. In: CVPR (2021)
work page 2021
-
[24]
Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional filters for dense prediction in panoramic images. In: ECCV (2018)
work page 2018
- [25]
-
[26]
Teng, Z., Zhang, J., Yang, K., Peng, K., Shi, H., Reiß, S., Cao, K., et al.: 360bev: Panoramic semantic mapping for indoor bird’s-eye view. In: WACV (2024)
work page 2024
-
[27]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., et al.: Attention is all you need. NeurIPS30(2017)
work page 2017
-
[28]
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: ECCV (2018)
work page 2018
-
[29]
Wright, L., Demeure, N.: Ranger21: a synergistic deep learning optimizer. arXiv:2106.13731 (2021)
-
[30]
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS 34(2021)
work page 2021
-
[31]
Xu, Y., Zhang, Z., Gao, S.: Spherical dnns and their applications in360◦ images and videos. TPAMI44(10) (2021)
work page 2021
-
[32]
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)
work page 2024
-
[33]
Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: Sam3d: Segment anything in 3d scenes. arXiv:2306.03908 (2023)
- [34]
-
[35]
Zhang, H., Li, F., Zou, X., Liu, S., Li, C., Yang, J., Zhang, L.: A simple framework for open-vocabulary segmentation and detection. In: ICCV (2023)
work page 2023
-
[36]
Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. T-ITS24(12) (2023)
work page 2023
-
[37]
Zhang, J., Yang, K., Shi, H., Reiß, S., Peng, K., Ma, C., Fu, H., et al.: Behind every domain there is a shift: Adapting distortion-aware vision transformers for panoramic semantic segmentation. TPAMI (2024)
work page 2024
-
[38]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Zhang, J., Ma, K., Kapse, S., Saltz, J., Vakalopoulou, M., Prasanna, P., Samaras, D.: Sam-path: A segment anything model for semantic segmentation in digital pathology. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2023)
work page 2023
- [39]
-
[40]
Zheng, Z., Lin, C., Nie, L., Liao, K., Shen, Z., Zhao, Y.: Complementary bi- directional feature compression for indoor 360deg semantic segmentation with self-distillation. In: WACV (2023)
work page 2023
-
[41]
Zhou, Y., Thielmann, P., Chamoli, A., Mirbach, B., Stricker, D., Rambach, J.: Particlesam: Small particle segmentation for material quality monitoring in recycling processes. arXiv:2508.03490 (2025)
-
[42]
Zhuang, C., Lu, Z., Wang, Y., Xiao, J., Wang, Y.: Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. In: AAAI. vol. 36 (2022) PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion Supplementary Material Mahdi Chamseddine1,2, Didier Stricker1,2, and Jason Rambach1 1 German Research Cent...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.