Recognition: unknown
Amodal SAM: A Unified Amodal Segmentation Framework with Generalization
Pith reviewed 2026-05-10 01:01 UTC · model grok-4.3
The pith
Amodal SAM adds a lightweight adapter and synthetic occlusion data to SAM so the model can predict complete object shapes including hidden parts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Amodal SAM extends SAM for both image and video amodal segmentation by inserting a lightweight Spatial Completion Adapter that reconstructs occluded regions while freezing the original SAM features. Training relies on a Target-Aware Occlusion Synthesis pipeline that generates diverse synthetic occlusions and on learning objectives that enforce regional consistency and topological regularization. This produces state-of-the-art results on standard benchmarks together with strong generalization to novel object categories and unseen contexts.
What carries the argument
The Spatial Completion Adapter is a lightweight module added to frozen SAM that reconstructs occluded object regions; it is trained with the Target-Aware Occlusion Synthesis pipeline for data creation and with consistency and regularization objectives.
Load-bearing premise
The synthetic occlusions generated by the Target-Aware Occlusion Synthesis pipeline have statistics and difficulty close enough to real-world cases that the adapter generalizes beyond the training distribution.
What would settle it
A direct comparison on real images of novel object categories with natural occlusions would falsify the generalization claim if Amodal SAM performs no better than unmodified SAM.
Figures
read the original abstract
Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Amodal SAM, a unified framework that adapts the Segment Anything Model (SAM) for amodal segmentation of both images and videos. It adds a lightweight Spatial Completion Adapter for reconstructing occluded regions, a Target-Aware Occlusion Synthesis (TAOS) pipeline to generate synthetic training data addressing annotation scarcity, and new learning objectives for regional consistency and topological regularization. The central claims are state-of-the-art performance on standard benchmarks together with robust generalization to novel object categories and unseen contexts.
Significance. If the empirical results hold, the work would meaningfully advance amodal segmentation by preserving SAM's strong zero-shot generalization while using synthetic data to overcome the lack of amodal annotations. The unified image-video treatment and focus on practical unconstrained settings are notable strengths that could influence downstream applications in robotics and scene understanding.
major comments (3)
- [§4.2] §4.2 (TAOS pipeline): The description of target-aware occlusion synthesis provides no quantitative comparison of key statistics (occlusion ratio histograms, boundary complexity, number of overlapping instances, or topological features) between TAOS-generated data and real amodal datasets such as KINS or COCOA. Because the generalization results in §5.3–5.4 rest on the assumption that synthetic occlusions sufficiently match real-world distributions, this omission directly threatens the validity of the adapter's learned behavior and the headline generalization claim.
- [Table 1, §5.1] Table 1 and §5.1: The SOTA performance numbers are reported without error bars, multiple random seeds, or ablations that isolate the contribution of TAOS from the adapter and the new objectives. Given the synthetic nature of the training data, the absence of these controls makes it impossible to determine whether the reported gains are robust or could be artifacts of the particular TAOS hyperparameter choices.
- [§3.3] §3.3 (Learning Objectives): The topological regularization term is motivated but its concrete effect on failure modes (e.g., thin structures or multiply occluded objects) is not analyzed; an ablation showing how removing this term changes performance on the novel-scenario test sets would be required to substantiate that it contributes to the claimed generalization rather than merely regularizing the synthetic training distribution.
minor comments (2)
- [Figure 3] Figure 3: The caption does not clearly indicate whether the visualized occlusions are real or TAOS-generated, nor does it label the specific failure modes being highlighted.
- [Abstract] The abstract states that experiments demonstrate SOTA results and generalization, yet the main text should ensure every quantitative claim is accompanied by the corresponding table or figure reference in the same paragraph.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript's claims regarding data fidelity, experimental robustness, and component contributions.
read point-by-point responses
-
Referee: [§4.2] §4.2 (TAOS pipeline): The description of target-aware occlusion synthesis provides no quantitative comparison of key statistics (occlusion ratio histograms, boundary complexity, number of overlapping instances, or topological features) between TAOS-generated data and real amodal datasets such as KINS or COCOA. Because the generalization results in §5.3–5.4 rest on the assumption that synthetic occlusions sufficiently match real-world distributions, this omission directly threatens the validity of the adapter's learned behavior and the headline generalization claim.
Authors: We agree that a direct quantitative comparison of occlusion statistics would strengthen the justification for using TAOS-generated data to support generalization. In the revised manuscript, we will expand §4.2 with a new analysis subsection and accompanying figure that reports occlusion ratio histograms, boundary complexity metrics, number of overlapping instances, and topological features, comparing TAOS outputs directly against KINS and COCOA. This addition will explicitly validate the distributional match and bolster the generalization results in §5.3–5.4. revision: yes
-
Referee: [Table 1, §5.1] Table 1 and §5.1: The SOTA performance numbers are reported without error bars, multiple random seeds, or ablations that isolate the contribution of TAOS from the adapter and the new objectives. Given the synthetic nature of the training data, the absence of these controls makes it impossible to determine whether the reported gains are robust or could be artifacts of the particular TAOS hyperparameter choices.
Authors: We concur that the lack of error bars, multi-seed statistics, and component-isolating ablations limits the ability to assess robustness, particularly with synthetic training data. In the revised version, we will update Table 1 and §5.1 to report mean and standard deviation over multiple random seeds (at least three runs) and insert new ablation tables that separately quantify the contributions of TAOS, the Spatial Completion Adapter, and the learning objectives. These controls will clarify that the SOTA gains are not artifacts of specific hyperparameter settings. revision: yes
-
Referee: [§3.3] §3.3 (Learning Objectives): The topological regularization term is motivated but its concrete effect on failure modes (e.g., thin structures or multiply occluded objects) is not analyzed; an ablation showing how removing this term changes performance on the novel-scenario test sets would be required to substantiate that it contributes to the claimed generalization rather than merely regularizing the synthetic training distribution.
Authors: We recognize that demonstrating the specific impact of the topological regularization term on generalization to novel scenarios would better substantiate its role beyond synthetic-data regularization. We will add a targeted ablation study (in §5.3 or a new subsection) that removes this term and evaluates performance changes on the novel-scenario test sets, with particular attention to failure modes involving thin structures and multiply occluded objects. This will directly address the concern and clarify the term's contribution to the overall generalization claims. revision: yes
Circularity Check
No circularity: empirical framework with independent benchmark validation
full rationale
The paper introduces Amodal SAM as a composite system (frozen SAM + lightweight adapter + TAOS synthesis + new objectives) whose performance claims rest on external benchmark results and held-out novel-scenario tests rather than any self-referential equation or fitted parameter renamed as a prediction. No mathematical derivations appear that equate outputs to inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (2)
- Spatial Completion Adapter weights
- TAOS generation hyperparameters
axioms (1)
- domain assumption SAM's pre-trained image encoder features contain sufficient information for occluded region reconstruction when augmented by a small adapter.
invented entities (2)
-
Spatial Completion Adapter
no independent evidence
-
Target-Aware Occlusion Synthesis (TAOS) pipeline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Doll ´ar, and R. B. Girshick, “Segment anything,” inIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. IEEE, 2023, pp. 3992–4003. [Online]. Available: https://doi.org/10.1109/ICCV51070.2023.00371
-
[2]
Y . Zhu, Y . Tian, D. N. Metaxas, and P. Doll ´ar, “Semantic amodal segmentation,” in2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 3001–3009. [Online]. Available: https://doi.org/10.1109/CVPR.2017.320
-
[3]
URL https://doi.org/10.1007/ 978-3-319-46475-6_5
K. Li and J. Malik, “Amodal instance segmentation,” inComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906. Springer, 2016, pp. 677–693. [Online]. Available: https://doi.org/10.1007/...
-
[4]
A weakly supervised amodal segmenter with boundary uncertainty estimation,
K. Nguyen and S. Todorovic, “A weakly supervised amodal segmenter with boundary uncertainty estimation,” inICCV, 2021
2021
-
[5]
Shapeformer: Shape prior visible-to-amodal transformer-based amodal instance segmentation,
M. Tran, W. Bounsavy, K. V o, A. Nguyen, T. Nguyen, and N. Le, “Shapeformer: Shape prior visible-to-amodal transformer-based amodal instance segmentation,” inIJCNN, 2024
2024
-
[6]
Aisformer: Amodal instance segmentation with transformer,
M. Q. Tran, K. V o, K. Yamazaki, A. A. F. Fernandes, M. Kidd, and N. Le, “Aisformer: Amodal instance segmentation with transformer,” in BMVC, 2022
2022
-
[7]
Amodal instance segmen- tation with KINS dataset,
L. Qi, L. Jiang, S. Liu, X. Shen, and J. Jia, “Amodal instance segmen- tation with KINS dataset,” inCVPR, 2019
2019
-
[8]
Amodal ground truth and completion in the wild,
G. Zhan, C. Zheng, W. Xie, and A. Zisserman, “Amodal ground truth and completion in the wild,” inCVPR, 2024
2024
-
[9]
Learning to see the invisible: End-to-end trainable amodal instance segmentation,
P. Follmann, R. K ¨onig, P. H ¨artinger, M. Klostermann, and T. B ¨ottger, “Learning to see the invisible: End-to-end trainable amodal instance segmentation,” inWACV, 2019
2019
-
[10]
Z. Zhang, A. Chen, L. Xie, J. Yu, and S. Gao, “Learning semantics-aware distance map with semantics layering network for amodal instance segmentation,” inProceedings of the 27th ACM International Conference on Multimedia, ser. MM ’19. ACM, Oct. 2019, p. 2124–2132. [Online]. Available: http://dx.doi.org/10.1145/ 3343031.3350911
-
[11]
U-Net: convolutional networks for biomedical image segmentation , booktitle =
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, ser. Lecture Notes in Computer Science, N. Navab, J. Hornegger, W. M. W. III, and...
-
[12]
Fully convolutional networks for semantic segmentation,
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inCVPR, 2015
2015
-
[13]
Mask R-CNN,
K. He, G. Gkioxari, P. Doll ´ar, and R. B. Girshick, “Mask R-CNN,” in ICCV, 2017
2017
-
[14]
T. Chen, L. Zhu, C. Ding, R. Cao, Y . Wang, Z. Li, L. Sun, P. Mao, and Y . Zang, “SAM fails to segment anything? - sam-adapter: Adapting SAM in underperformed scenes: Camouflage, shadow, and more,”CoRR, vol. abs/2304.09148, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.09148
-
[15]
Semantic-sam: Segment and recognize anything at any granularity
F. Li, H. Zhang, P. Sun, X. Zou, S. Liu, J. Yang, C. Li, L. Zhang, and J. Gao, “Semantic-sam: Segment and recognize anything at any granularity,”CoRR, vol. abs/2307.04767, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.04767
-
[16]
Segment anything in medical images,
J. Ma and B. Wang, “Segment anything in medical images,”CoRR, vol. abs/2304.12306, 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2304.12306
-
[17]
Segment anything in high quality,
L. Ke, M. Ye, M. Danelljan, Y . Liu, Y . Tai, C. Tang, and F. Yu, “Segment anything in high quality,” inNeurIPS, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023
2023
-
[18]
Parameter-efficient transfer learning for NLP,
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” inICML, K. Chaudhuri and R. Salakhutdinov, Eds., 2019
2019
-
[19]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inICLR, 2022. IEEE TRANSACTIONS ON IMAGE PROCESSING 12
2022
-
[20]
Vision transformer adapter for dense predictions,
Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,” inICLR, 2023
2023
-
[21]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y . Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll ´ar, and C. Feichtenhofer, “SAM 2: Segment anything in images and videos,”CoRR, vol. abs/2408.00714,
work page internal anchor Pith review arXiv
-
[22]
SAM 2: Segment Anything in Images and Videos
[Online]. Available: https://doi.org/10.48550/arXiv.2408.00714
work page internal anchor Pith review doi:10.48550/arxiv.2408.00714
-
[23]
Unsupervised deep metric learning with transformed attention consistency and contrastive clustering loss,
Y . Li, S. Kan, and Z. He, “Unsupervised deep metric learning with transformed attention consistency and contrastive clustering loss,” in ECCV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., 2020
2020
-
[24]
Tokens-to-token vit: Training vision transformers from scratch on imagenet,
L. Yuan, Y . Chen, T. Wang, W. Yu, Y . Shi, Z. Jiang, F. E. H. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” inICCV, 2021
2021
-
[25]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inICCV, 2021
2021
-
[26]
Free-form image inpainting with gated convolution,
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” inICCV, 2019
2019
-
[27]
Identification and rejection of scattered neutrons in agata,
M. S ¸enyi˘git, A. Atac ¸, S. Akkoyun, A. Kas ¸kas ¸, D. Bazzacco, J. Nyberg, F. Recchia, S. Brambilla, F. Camera, F. Crespi, E. Farnea, A. Giaz, A. Gottardo, R. Kempley, J. Ljungvall, D. Mengoni, C. Michelagnoli, B. Million, M. Palacz, L. Pellegri, S. Riboldi, E. S ¸ahin, P. S ¨oderstr¨om, and J. Valiente Dobon, “Identification and rejection of scattered...
-
[28]
Per-pixel classification is not all you need for semantic segmentation,
B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” inNeurIPS, M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021
2021
-
[29]
Generative adversarial networks,
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y . Bengio, “Generative adversarial networks,”Commun. ACM, vol. 63, no. 11, pp. 139–144, 2020. [Online]. Available: https://doi.org/10.1145/3422622
-
[30]
Hiera: A hierarchical vision transformer without the bells- and-whistles,
C. Ryali, Y . Hu, D. Bolya, C. Wei, H. Fan, P. Huang, V . Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y . Li, and C. Fe- ichtenhofer, “Hiera: A hierarchical vision transformer without the bells- and-whistles,” inICML, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., 2023
2023
-
[31]
Dn-splatter: Depth and normal priors for gaussian splatting and meshing
P. Follmann, R. K ¨onig, P. H ¨artinger, M. Klostermann, and T. B ¨ottger, “Learning to see the invisible: End-to-end trainable amodal instance segmentation,” inIEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019. IEEE, 2019, pp. 1328–1336. [Online]. Available: https://doi.org/10.1109/W ACV ....
work page doi:10.1109/w 2019
-
[32]
Unsupervised object learning via common fate,
M. Tangemann, S. Schneider, J. von K ¨ugelgen, F. Locatello, P. V . Gehler, T. Brox, M. K ¨ummerer, M. Bethge, and B. Sch ¨olkopf, “Unsupervised object learning via common fate,” inConference on Causal Learning and Reasoning, CLeaR 2023, 11-14 April 2023, Amazon Development Center , T ¨ubingen, Germany, April 11-14, 2023, ser. Proceedings of Machine Learn...
2023
-
[33]
Coarse- to-fine amodal segmentation with shape prior,
J. Gao, X. Qian, Y . Wang, T. Xiao, T. He, Z. Zhang, and Y . Fu, “Coarse- to-fine amodal segmentation with shape prior,” inICCV, 2023
2023
-
[34]
Microsoft COCO: common objects in context,
T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” inECCV, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., 2014
2014
-
[35]
Self-supervised scene de-occlusion,
X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, and C. C. Loy, “Self-supervised scene de-occlusion,” inCVPR, 2020
2020
-
[36]
Amodal segmentation based on visible region segmentation and shape prior,
Y . Xiao, Y . Xu, Z. Zhong, W. Luo, J. Li, and S. Gao, “Amodal segmentation based on visible region segmentation and shape prior,” inAAAI, 2021
2021
-
[37]
PLUG: revisiting amodal segmentation with foundation model and hierarchical focus,
Z. Liu, L. Qiao, X. Chu, and T. Jiang, “PLUG: revisiting amodal segmentation with foundation model and hierarchical focus,”CoRR, vol. abs/2405.16094, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2405.16094
-
[38]
E. Ozguroglu, R. Liu, D. Sur ´ıs, D. Chen, A. Dave, P. Tokmakov, and C. V ondrick, “pix2gestalt: Amodal segmentation by synthesizing wholes,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. IEEE, 2024, pp. 3931–3940. [Online]. Available: https://doi.org/10.1109/ CVPR52733.2024.00377
-
[39]
Towards efficient foundation model for zero-shot amodal segmentation,
Z. Liu, L. Qiao, X. Chu, L. Ma, and T. Jiang, “Towards efficient foundation model for zero-shot amodal segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 2025, pp. 20 254–20 264. [Online]. Available: https://openaccess.thecvf.com/content/CVPR...
2025
-
[40]
Self-supervised amodal video object segmentation,
J. Yao, Y . Hong, C. Wang, T. Xiao, T. He, F. Locatello, D. P. Wipf, Y . Fu, and Z. Zhang, “Self-supervised amodal video object segmentation,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A....
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.