SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Guangwei Gao; Guoan Xu; Guo-jun Qi; Jiaming Chen; Wenfeng Huang; Wenjing Jia

arxiv: 2411.17061 · v2 · submitted 2024-11-26 · 💻 cs.CV

SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Guoan Xu , Jiaming Chen , Wenfeng Huang , Wenjing Jia , Guangwei Gao , Guo-jun Qi This is my paper

Pith reviewed 2026-05-23 17:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationstrip cross-attentiondecoder headvision transformerefficient inferencecross-layer blockmulti-scale features

0 comments

The pith

SCASeg replaces skip connections with lateral strip cross-attention using encoder features as queries to achieve competitive segmentation accuracy at higher efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCASeg as a decoder head tailored for semantic segmentation with Vision Transformer encoders. It replaces conventional skip connections with lateral connections that treat encoder features as queries in cross-attention modules. A Cross-Layer Block combines hierarchical maps from multiple encoder and decoder stages into keys and values while adding convolution to capture local context. Channel compression reduces queries and keys to one dimension, forming strip patterns that cut memory use and raise inference speed. Experiments across ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012 show the decoder matches or exceeds leading architectures under varied computational limits.

Core claim

SCASeg establishes that a decoder using encoder features directly as queries in cross-attention, integrated via a Cross-Layer Block that unifies multi-stage features with convolutional local perception, and compressed into strip attention patterns, delivers competitive semantic segmentation performance with greater efficiency than standard decoder designs.

What carries the argument

Strip Cross-Attention with the Cross-Layer Block (CLB), where encoder features serve as queries and compressed hierarchical maps supply keys and values in strip form.

If this is right

SCASeg adapts to multiple encoder backbones while preserving efficiency gains.
The strip compression lowers memory footprint and raises inference speed relative to vanilla cross-attention.
The decoder maintains competitive accuracy on ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012 across different compute budgets.
CLB integration enables capture of both global dependencies and local context across scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The channel-compression trick for creating strip attention could be tested in other dense-prediction heads to reduce compute.
Lateral query design might transfer to related tasks such as instance segmentation or object detection.
Deployment trials on edge hardware would reveal whether the reported speedups translate to real-time settings.

Load-bearing premise

The design assumes encoder features as queries plus CLB integration and channel compression will reliably boost multi-scale interaction and efficiency without hidden costs to generalization or stability on untested backbones or domains.

What would settle it

A controlled test in which SCASeg underperforms a standard decoder on a previously unused backbone or dataset would show the performance gains do not hold generally.

Figures

Figures reproduced from arXiv: 2411.17061 by Guangwei Gao, Guoan Xu, Guo-jun Qi, Jiaming Chen, Wenfeng Huang, Wenjing Jia.

**Figure 2.** Figure 2: The overall architecture of our proposed SCASeg and Cross-Layer Block (CLB). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed Strip Cross-Attention in comparison with the vanilla Self-Attention and Cross-Attention. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The architecture of our Local Perception Module (LPM). [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Visualized feature maps obtained before applying the Cross-Layer Block and after. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: A visual comparison of segmentation results obtained [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Visual segmentation results obtained on the ADE20K [ [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The Cross-Layer Block (CLB) in the proposed SCASeg compared to its counterparts in SOTA approaches. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Visual segmentation results obtained on the Cityscapes [ [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general-purpose visual encoders, ViT backbones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross-attention modules. Additionally, we introduce a Cross-Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for Keys and Values. The CLB also incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers, thus enhancing feature interaction at different scales and improving overall efficiency. To further optimize computational efficiency, SCASeg compresses the channels of queries and keys into one dimension, creating strip-like patterns that reduce memory usage and increase inference speed compared to traditional vanilla cross-attention. Experiments show that SCASeg's adaptable decoder delivers competitive performance across various setups, outperforming leading segmentation architectures on benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under diverse computational constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCASeg adds a concrete decoder recipe with 1D strip cross-attention and CLB fusion but the benchmark wins lack the ablations needed to trust the efficiency claims.

read the letter

The paper's main contribution is a segmentation decoder that swaps standard skip connections for lateral encoder-to-decoder links, treats encoder features as queries in cross-attention, fuses stages via a Cross-Layer Block that mixes convolution with hierarchical keys and values, and compresses query/key channels to one dimension to produce strip attention patterns. This is a specific, workable combination aimed at cutting memory and speed costs while retaining multi-scale context. The reported results on ADE20K, Cityscapes, COCO-Stuff, and Pascal VOC claim competitive or better numbers across compute budgets, which is the practical hook for real-time use cases. The design choices are straightforward engineering moves that target known pain points in ViT decoders. The 1D compression plus CLB path is the part that could actually deliver the efficiency gain if it works. The soft spot is that the abstract gives no ablations, no training protocol details, no error bars, and no direct comparison of the compression step against full-channel attention. The stress-test worry about lost representational capacity from the 1D reduction is reasonable on the evidence shown; nothing in the summary demonstrates that the CLB convolution path reliably restores the discarded channel interactions. Without those checks the outperformance could be brittle. This is useful reading for people already working on efficient segmentation heads who want another decoder option to test. It is incremental rather than foundational, but the concrete architecture and benchmark numbers are enough to justify sending it to referees who can examine the full experiments and code.

Referee Report

2 major / 2 minor

Summary. The paper introduces SCASeg, a decoder head for semantic segmentation on ViT backbones. It replaces conventional skip connections with lateral connections that use encoder features as Queries in cross-attention, introduces a Cross-Layer Block (CLB) that fuses hierarchical encoder/decoder maps into Keys and Values while adding convolutional local context, and compresses query/key channels to a single dimension to produce strip-like attention patterns that reduce memory and increase speed. Experiments claim that this decoder outperforms leading segmentation architectures on ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012 under varied computational budgets.

Significance. If the performance claims are robust, the design offers a practical route to more efficient task-specific decoders that combine global cross-attention with local convolution and hierarchical fusion, potentially improving inference speed and memory footprint for ViT-based segmentation without sacrificing accuracy on standard benchmarks.

major comments (2)

[Method (strip cross-attention and CLB description)] The central efficiency claim rests on compressing query and key channels to one dimension while relying on the CLB convolution path to recover multi-scale interactions; no ablation isolates whether this reduction discards irrecoverable per-channel distinctions that affect representational capacity under distribution shift.
[Experiments section] The outperformance claims on ADE20K, Cityscapes, COCO-Stuff, and Pascal VOC require explicit reporting of baselines, training protocols, statistical significance, and error bars; the abstract provides none, and the manuscript must demonstrate that gains are not attributable to unstated hyper-parameter advantages or single-run variance.

minor comments (2)

[Method] Notation for the CLB integration of hierarchical maps should be formalized with explicit equations rather than prose description to allow reproducibility.
[Figures] Figure captions for attention visualizations should state the exact input resolution and backbone used so readers can interpret the strip patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions that will be incorporated.

read point-by-point responses

Referee: [Method (strip cross-attention and CLB description)] The central efficiency claim rests on compressing query and key channels to one dimension while relying on the CLB convolution path to recover multi-scale interactions; no ablation isolates whether this reduction discards irrecoverable per-channel distinctions that affect representational capacity under distribution shift.

Authors: We agree that an explicit ablation isolating the channel compression would strengthen the claims. In the revised manuscript we will add an ablation comparing the 1D strip cross-attention against a full-channel cross-attention variant (both with and without the CLB) on ADE20K and Cityscapes. This will quantify any loss in per-channel representational capacity and confirm that the convolutional path within the CLB recovers the necessary multi-scale interactions. revision: yes
Referee: [Experiments section] The outperformance claims on ADE20K, Cityscapes, COCO-Stuff, and Pascal VOC require explicit reporting of baselines, training protocols, statistical significance, and error bars; the abstract provides none, and the manuscript must demonstrate that gains are not attributable to unstated hyper-parameter advantages or single-run variance.

Authors: We will revise the experimental section to include a dedicated table of training hyperparameters and protocols, ensuring all baselines are reproduced under identical settings. We will also report mean and standard deviation over three random seeds for the main results and update the abstract with key quantitative metrics. These additions will demonstrate that the reported gains are robust and not attributable to single-run variance or undisclosed hyper-parameter choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on benchmarks

full rationale

The paper introduces SCASeg as a decoder design using strip cross-attention and CLB, then reports competitive results on ADE20K, Cityscapes, COCO-Stuff, and Pascal VOC via direct experiments. No derivation chain, first-principles predictions, or equations exist that reduce outputs to inputs by construction. Claims rest on benchmark comparisons rather than self-definitional fits, self-citation load-bearing, or renamed known results. This matches the default case of a non-circular empirical CV contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method introduces learned modules whose internal parameters are trained on data.

pith-pipeline@v0.9.0 · 5802 in / 1191 out tokens · 42360 ms · 2026-05-23T17:01:24.702260+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

[1]

Xcit: Cross-covariance image transformers

Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bo- janowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Na- talia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image transformers. Advances in Neural Information processing Systems, 34:20014–20027, 2021. 2, 5

work page 2021
[2]

Medical image segmentation review: The suc- cess of u-net

Reza Azad, Ehsan Khodapanah Aghdam, Amelie Rauland, Yiwei Jia, Atlas Haddadi Avval, Afshin Bozorgpour, Sanaz Karimijafarbigloo, Joseph Paul Cohen, Ehsan Adeli, and Dorit Merhof. Medical image segmentation review: The suc- cess of u-net. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10076–10095, 2024. 1

work page 2024
[3]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018. 6, 7

work page 2018
[4]

Sdpt: Semantic- aware dimension-pooling transformer for image segmenta- tion

Hu Cao, Guang Chen, Hengshuang Zhao, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Alois Knoll. Sdpt: Semantic- aware dimension-pooling transformer for image segmenta- tion. IEEE Transactions on Intelligent Transportation Sys- tems, 25(11):15934–15946, 2024. 6

work page 2024
[5]

Pem: Prototype-based efficient maskformer for image segmentation

Niccol `o Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, and Fabio Cermelli. Pem: Prototype-based efficient maskformer for image segmentation. In CVPR, pages 15804–15813,

work page
[6]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation. arXiv preprint arXiv:1706.05587 ,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 2

work page 2018
[8]

Per- pixel classification is not all you need for semantic segmen- tation

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1, 3

work page 2021
[9]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022. 2

work page 2022
[10]

MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https : / / github . com / open - mmlab/mmsegmentation, 2020. 6

work page 2020
[11]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016. 6, 2, 3, 4

work page 2016
[12]

Boundary-aware feature propa- gation for scene segmentation

Henghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, and Gang Wang. Boundary-aware feature propa- gation for scene segmentation. In ICCV, pages 6819–6829,

work page
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Dual attention network for scene segmentation

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019. 1, 2

work page 2019
[15]

Cmt: Convolutional neural networks meet vision transformers

Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022. 2, 5

work page 2022
[16]

Segnext: Rethink- ing convolutional attention design for semantic segmenta- tion

Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethink- ing convolutional attention design for semantic segmenta- tion. Advances in Neural Information Processing Systems , 35:1140–1156, 2022. 1, 2, 6, 7, 3

work page 2022
[17]

Adaptive pyramid context network for semantic seg- mentation

Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for semantic seg- mentation. In CVPR, pages 7519–7528, 2019. 2

work page 2019
[18]

Pas- cal voc 2008 challenge

Derek Hoiem, Santosh K Divvala, and James H Hays. Pas- cal voc 2008 challenge. World Literature Today, 24(1):1–4,

work page 2008
[19]

Ccnet: Criss-cross attention for semantic segmentation

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pages 603– 612, 2019. 2

work page 2019
[20]

Metaseg: Metaformer-based global contexts-aware network for efficient semantic segmentation

Beoungwoo Kang, Seunghun Moon, Yubin Cho, Hyunwoo Yu, and Suk-Ju Kang. Metaseg: Metaformer-based global contexts-aware network for efficient semantic segmentation. In WACV, pages 434–443, 2024. 2, 5, 6, 7, 1, 3

work page 2024
[21]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In ICCV, pages 4015–4026, 2023. 1

work page 2023
[22]

Lisa: Reasoning segmenta- tion via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. In CVPR, pages 9579–9589,

work page
[23]

Semantic image segmenta- tion with deep convolutional nets and fully connected crfs

Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille. Semantic image segmenta- tion with deep convolutional nets and fully connected crfs. In ICLR, 2015. 2

work page 2015
[24]

Scale-aware modulation meet transformer

Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, and Lian- wen Jin. Scale-aware modulation meet transformer. InICCV, pages 6015–6026, 2023. 2, 5

work page 2023
[25]

Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation

Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, pages 82–92, 2019. 2

work page 2019
[26]

Bpkd: Boundary privileged knowledge distillation for semantic segmentation

Liyang Liu, Zihan Wang, Minh Hieu Phan, Bowen Zhang, Jinchao Ge, and Yifan Liu. Bpkd: Boundary privileged knowledge distillation for semantic segmentation. In WACV, pages 1062–1072, 2024. 2

work page 2024
[27]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015. 1, 2 9

work page 2015
[28]

Efficient modulation for vision net- works

Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, and Lu Yuan. Efficient modulation for vision net- works. arXiv preprint arXiv:2403.19963, 2024. 1, 2

work page arXiv 2024
[29]

Large kernel matters–improve semantic segmenta- tion by global convolutional network

Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmenta- tion by global convolutional network. InCVPR, pages 4353– 4361, 2017. 2

work page 2017
[30]

A transformer-based decoder for semantic segmentation with multi-level context mining

Bowen Shi, Dongsheng Jiang, Xiaopeng Zhang, Han Li, Wenrui Dai, Junni Zou, Hongkai Xiong, and Qi Tian. A transformer-based decoder for semantic segmentation with multi-level context mining. In ECCV, pages 624–639. Springer, 2022. 1, 2

work page 2022
[31]

Feedformer: Revisiting transformer decoder for ef- ficient semantic segmentation

Jae-hun Shim, Hyunwoo Yu, Kyeongbo Kong, and Suk-Ju Kang. Feedformer: Revisiting transformer decoder for ef- ficient semantic segmentation. In AAAI, pages 2263–2271,

work page
[32]

Segmenter: Transformer for semantic segmenta- tion

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmenta- tion. In ICCV, pages 7262–7272, 2021. 2, 3

work page 2021
[33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30:1–11, 2017. 1

work page 2017
[34]

Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation

Qiang Wan, Zilong Huang, Jiachen Lu, YU Gang, and Li Zhang. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In ICLR, 2023. 6, 2

work page 2023
[35]

Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model

Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model. Advances in Neural Information Processing Systems, 36, 2024. 1

work page 2024
[36]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 2

work page 2018
[37]

Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions

Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and Yifeng Shi. Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions. In CVPR, pages 5493–5502, 2024. 2, 3

work page 2024
[38]

Segformer: Simple and efficient design for semantic segmentation with transform- ers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers. Advances in Neural Information Processing Systems , 34:12077–12090, 2021. 2, 6, 7, 3

work page 2021
[39]

Lightweight real-time semantic seg- mentation network with efficient transformer and cnn

Guoan Xu, Juncheng Li, Guangwei Gao, Huimin Lu, Jian Yang, and Dong Yue. Lightweight real-time semantic seg- mentation network with efficient transformer and cnn. IEEE Transactions on Intelligent Transportation Systems, 24(12): 15897–15906, 2023. 1

work page 2023
[40]

Mac- former: Semantic segmentation with fine object boundaries

Guoan Xu, Wenfeng Huang, Tao Wu, Ligeng Chen, Wenjing Jia, Guangwei Gao, Xiatian Zhu, and Stuart Perry. Mac- former: Semantic segmentation with fine object boundaries. arXiv preprint arXiv:2408.05699, 2024. 2, 1

work page arXiv 2024
[41]

Sctnet: Single-branch cnn with transformer semantic information for real-time segmen- tation

Zhengze Xu, Dongyue Wu, Changqian Yu, Xiangxiang Chu, Nong Sang, and Changxin Gao. Sctnet: Single-branch cnn with transformer semantic information for real-time segmen- tation. In AAAI, pages 6378–6386, 2024. 2, 6

work page 2024
[42]

Multi-scale rep- resentations by varing window attention for semantic seg- mentation

Haotian Yan, Ming Wu, and Chuang Zhang. Multi-scale rep- resentations by varing window attention for semantic seg- mentation. In ICLR, 2024. 6

work page 2024
[43]

Multi-scale rep- resentations by varying window attention for semantic seg- mentation

Haotian Yan, Ming Wu, and Chuang Zhang. Multi-scale rep- resentations by varying window attention for semantic seg- mentation. In ICLR, 2024. 6, 7, 2, 3

work page 2024
[44]

U-mixformer: Unet- like transformer with mix-attention for efficient semantic segmentation

Seul-Ki Yeom and Julian von Klitzing. U-mixformer: Unet- like transformer with mix-attention for efficient semantic segmentation. arXiv preprint arXiv:2312.06272 , 2023. 2, 3, 6, 7, 1

work page arXiv 2023
[45]

Context prior for scene seg- mentation

Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, and Nong Sang. Context prior for scene seg- mentation. In CVPR, pages 12416–12425, 2020. 2

work page 2020
[46]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InCVPR, pages 10819– 10829, 2022. 2, 4

work page 2022
[47]

Object- contextual representations for semantic segmentation

Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. In ECCV, pages 173–190. Springer, 2020. 1, 2

work page 2020
[48]

Segfix: Model-agnostic boundary refinement for segmenta- tion

Yuhui Yuan, Jingyi Xie, Xilin Chen, and Jingdong Wang. Segfix: Model-agnostic boundary refinement for segmenta- tion. In ECCV, pages 489–506. Springer, 2020. 2

work page 2020
[49]

Con- text encoding for semantic segmentation

Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- text encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018. 1, 2

work page 2018
[50]

Joint se- mantic segmentation and boundary detection using iterative pyramid contexts

Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei Shen, Jiaxiang Shang, Tian Fang, and Long Quan. Joint se- mantic segmentation and boundary detection using iterative pyramid contexts. In CVPR, pages 13666–13675, 2020. 2

work page 2020
[51]

Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers. In CVPR, pages 6881–6890, 2021. 2, 3

work page 2021
[52]

Squeeze-and-attention networks for semantic segmentation

Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu, Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, and Alexander Wong. Squeeze-and-attention networks for semantic segmentation. In CVPR, pages 13065–13074,

work page
[53]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017. 1, 6, 8, 2, 3 10 SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation Supplementary Material This supplementary document provides additional in- sights and experimental results to compleme...

work page 2017
[54]

In this Supplementary, we il- lustrate the relationship of the proposed Cross-Layer Block (CLB) to other SOTA attention blocks, as shown in Fig

Relationship of CLB with Existing Attention Blocks In the main paper, we introduced a Cross-Layer Block (CLB) that blends hierarchical feature maps from different encoder and decoder stages to create a unified representa- tion for Keys and Values. In this Supplementary, we il- lustrate the relationship of the proposed Cross-Layer Block (CLB) to other SOTA...

work page
[55]

In this Sup- plementary, we present additional experimental comparison conducted with medium-weight and heavy-weight models on ADE20K and Cityscapes

Additional Experimental Comparisons with Medium-weight and Heavy-weight Models In the main paper, we compared the performance of the SCASeg (MiT-B0) with lightweight models. In this Sup- plementary, we present additional experimental comparison conducted with medium-weight and heavy-weight models on ADE20K and Cityscapes. 2.1. Medium-weight Models: As pre...

work page
[56]

9 shows additional visual comparison of the segmen- tation results obtained on the Cityscapes datasets using our SCASeg and SOTA methods

Additional Visualization Results Fig. 9 shows additional visual comparison of the segmen- tation results obtained on the Cityscapes datasets using our SCASeg and SOTA methods

work page
[57]

Additional Ablation Studies Effectiveness of the Local Perception Module (LPM): Table 3 in the main paper also presents the results of com- bining SCA with LPM, forming the complete CLB struc- ture. With the addition of LPM, the parameter count and computational load become comparable to those of CA, while this combination achieves an increase in segmenta...

work page 2019

[1] [1]

Xcit: Cross-covariance image transformers

Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bo- janowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Na- talia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image transformers. Advances in Neural Information processing Systems, 34:20014–20027, 2021. 2, 5

work page 2021

[2] [2]

Medical image segmentation review: The suc- cess of u-net

Reza Azad, Ehsan Khodapanah Aghdam, Amelie Rauland, Yiwei Jia, Atlas Haddadi Avval, Afshin Bozorgpour, Sanaz Karimijafarbigloo, Joseph Paul Cohen, Ehsan Adeli, and Dorit Merhof. Medical image segmentation review: The suc- cess of u-net. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10076–10095, 2024. 1

work page 2024

[3] [3]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018. 6, 7

work page 2018

[4] [4]

Sdpt: Semantic- aware dimension-pooling transformer for image segmenta- tion

Hu Cao, Guang Chen, Hengshuang Zhao, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Alois Knoll. Sdpt: Semantic- aware dimension-pooling transformer for image segmenta- tion. IEEE Transactions on Intelligent Transportation Sys- tems, 25(11):15934–15946, 2024. 6

work page 2024

[5] [5]

Pem: Prototype-based efficient maskformer for image segmentation

Niccol `o Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, and Fabio Cermelli. Pem: Prototype-based efficient maskformer for image segmentation. In CVPR, pages 15804–15813,

work page

[6] [6]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation. arXiv preprint arXiv:1706.05587 ,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 2

work page 2018

[8] [8]

Per- pixel classification is not all you need for semantic segmen- tation

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1, 3

work page 2021

[9] [9]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022. 2

work page 2022

[10] [10]

MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https : / / github . com / open - mmlab/mmsegmentation, 2020. 6

work page 2020

[11] [11]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016. 6, 2, 3, 4

work page 2016

[12] [12]

Boundary-aware feature propa- gation for scene segmentation

Henghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, and Gang Wang. Boundary-aware feature propa- gation for scene segmentation. In ICCV, pages 6819–6829,

work page

[13] [13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

Dual attention network for scene segmentation

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019. 1, 2

work page 2019

[15] [15]

Cmt: Convolutional neural networks meet vision transformers

Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022. 2, 5

work page 2022

[16] [16]

Segnext: Rethink- ing convolutional attention design for semantic segmenta- tion

Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethink- ing convolutional attention design for semantic segmenta- tion. Advances in Neural Information Processing Systems , 35:1140–1156, 2022. 1, 2, 6, 7, 3

work page 2022

[17] [17]

Adaptive pyramid context network for semantic seg- mentation

Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for semantic seg- mentation. In CVPR, pages 7519–7528, 2019. 2

work page 2019

[18] [18]

Pas- cal voc 2008 challenge

Derek Hoiem, Santosh K Divvala, and James H Hays. Pas- cal voc 2008 challenge. World Literature Today, 24(1):1–4,

work page 2008

[19] [19]

Ccnet: Criss-cross attention for semantic segmentation

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pages 603– 612, 2019. 2

work page 2019

[20] [20]

Metaseg: Metaformer-based global contexts-aware network for efficient semantic segmentation

Beoungwoo Kang, Seunghun Moon, Yubin Cho, Hyunwoo Yu, and Suk-Ju Kang. Metaseg: Metaformer-based global contexts-aware network for efficient semantic segmentation. In WACV, pages 434–443, 2024. 2, 5, 6, 7, 1, 3

work page 2024

[21] [21]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In ICCV, pages 4015–4026, 2023. 1

work page 2023

[22] [22]

Lisa: Reasoning segmenta- tion via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. In CVPR, pages 9579–9589,

work page

[23] [23]

Semantic image segmenta- tion with deep convolutional nets and fully connected crfs

Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille. Semantic image segmenta- tion with deep convolutional nets and fully connected crfs. In ICLR, 2015. 2

work page 2015

[24] [24]

Scale-aware modulation meet transformer

Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, and Lian- wen Jin. Scale-aware modulation meet transformer. InICCV, pages 6015–6026, 2023. 2, 5

work page 2023

[25] [25]

Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation

Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, pages 82–92, 2019. 2

work page 2019

[26] [26]

Bpkd: Boundary privileged knowledge distillation for semantic segmentation

Liyang Liu, Zihan Wang, Minh Hieu Phan, Bowen Zhang, Jinchao Ge, and Yifan Liu. Bpkd: Boundary privileged knowledge distillation for semantic segmentation. In WACV, pages 1062–1072, 2024. 2

work page 2024

[27] [27]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015. 1, 2 9

work page 2015

[28] [28]

Efficient modulation for vision net- works

Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, and Lu Yuan. Efficient modulation for vision net- works. arXiv preprint arXiv:2403.19963, 2024. 1, 2

work page arXiv 2024

[29] [29]

Large kernel matters–improve semantic segmenta- tion by global convolutional network

Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmenta- tion by global convolutional network. InCVPR, pages 4353– 4361, 2017. 2

work page 2017

[30] [30]

A transformer-based decoder for semantic segmentation with multi-level context mining

Bowen Shi, Dongsheng Jiang, Xiaopeng Zhang, Han Li, Wenrui Dai, Junni Zou, Hongkai Xiong, and Qi Tian. A transformer-based decoder for semantic segmentation with multi-level context mining. In ECCV, pages 624–639. Springer, 2022. 1, 2

work page 2022

[31] [31]

Feedformer: Revisiting transformer decoder for ef- ficient semantic segmentation

Jae-hun Shim, Hyunwoo Yu, Kyeongbo Kong, and Suk-Ju Kang. Feedformer: Revisiting transformer decoder for ef- ficient semantic segmentation. In AAAI, pages 2263–2271,

work page

[32] [32]

Segmenter: Transformer for semantic segmenta- tion

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmenta- tion. In ICCV, pages 7262–7272, 2021. 2, 3

work page 2021

[33] [33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30:1–11, 2017. 1

work page 2017

[34] [34]

Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation

Qiang Wan, Zilong Huang, Jiachen Lu, YU Gang, and Li Zhang. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In ICLR, 2023. 6, 2

work page 2023

[35] [35]

Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model

Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model. Advances in Neural Information Processing Systems, 36, 2024. 1

work page 2024

[36] [36]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 2

work page 2018

[37] [37]

Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions

Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and Yifeng Shi. Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions. In CVPR, pages 5493–5502, 2024. 2, 3

work page 2024

[38] [38]

Segformer: Simple and efficient design for semantic segmentation with transform- ers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers. Advances in Neural Information Processing Systems , 34:12077–12090, 2021. 2, 6, 7, 3

work page 2021

[39] [39]

Lightweight real-time semantic seg- mentation network with efficient transformer and cnn

Guoan Xu, Juncheng Li, Guangwei Gao, Huimin Lu, Jian Yang, and Dong Yue. Lightweight real-time semantic seg- mentation network with efficient transformer and cnn. IEEE Transactions on Intelligent Transportation Systems, 24(12): 15897–15906, 2023. 1

work page 2023

[40] [40]

Mac- former: Semantic segmentation with fine object boundaries

Guoan Xu, Wenfeng Huang, Tao Wu, Ligeng Chen, Wenjing Jia, Guangwei Gao, Xiatian Zhu, and Stuart Perry. Mac- former: Semantic segmentation with fine object boundaries. arXiv preprint arXiv:2408.05699, 2024. 2, 1

work page arXiv 2024

[41] [41]

Sctnet: Single-branch cnn with transformer semantic information for real-time segmen- tation

Zhengze Xu, Dongyue Wu, Changqian Yu, Xiangxiang Chu, Nong Sang, and Changxin Gao. Sctnet: Single-branch cnn with transformer semantic information for real-time segmen- tation. In AAAI, pages 6378–6386, 2024. 2, 6

work page 2024

[42] [42]

Multi-scale rep- resentations by varing window attention for semantic seg- mentation

Haotian Yan, Ming Wu, and Chuang Zhang. Multi-scale rep- resentations by varing window attention for semantic seg- mentation. In ICLR, 2024. 6

work page 2024

[43] [43]

Multi-scale rep- resentations by varying window attention for semantic seg- mentation

Haotian Yan, Ming Wu, and Chuang Zhang. Multi-scale rep- resentations by varying window attention for semantic seg- mentation. In ICLR, 2024. 6, 7, 2, 3

work page 2024

[44] [44]

U-mixformer: Unet- like transformer with mix-attention for efficient semantic segmentation

Seul-Ki Yeom and Julian von Klitzing. U-mixformer: Unet- like transformer with mix-attention for efficient semantic segmentation. arXiv preprint arXiv:2312.06272 , 2023. 2, 3, 6, 7, 1

work page arXiv 2023

[45] [45]

Context prior for scene seg- mentation

Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, and Nong Sang. Context prior for scene seg- mentation. In CVPR, pages 12416–12425, 2020. 2

work page 2020

[46] [46]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InCVPR, pages 10819– 10829, 2022. 2, 4

work page 2022

[47] [47]

Object- contextual representations for semantic segmentation

Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. In ECCV, pages 173–190. Springer, 2020. 1, 2

work page 2020

[48] [48]

Segfix: Model-agnostic boundary refinement for segmenta- tion

Yuhui Yuan, Jingyi Xie, Xilin Chen, and Jingdong Wang. Segfix: Model-agnostic boundary refinement for segmenta- tion. In ECCV, pages 489–506. Springer, 2020. 2

work page 2020

[49] [49]

Con- text encoding for semantic segmentation

Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- text encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018. 1, 2

work page 2018

[50] [50]

Joint se- mantic segmentation and boundary detection using iterative pyramid contexts

Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei Shen, Jiaxiang Shang, Tian Fang, and Long Quan. Joint se- mantic segmentation and boundary detection using iterative pyramid contexts. In CVPR, pages 13666–13675, 2020. 2

work page 2020

[51] [51]

Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers. In CVPR, pages 6881–6890, 2021. 2, 3

work page 2021

[52] [52]

Squeeze-and-attention networks for semantic segmentation

Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu, Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, and Alexander Wong. Squeeze-and-attention networks for semantic segmentation. In CVPR, pages 13065–13074,

work page

[53] [53]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017. 1, 6, 8, 2, 3 10 SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation Supplementary Material This supplementary document provides additional in- sights and experimental results to compleme...

work page 2017

[54] [54]

In this Supplementary, we il- lustrate the relationship of the proposed Cross-Layer Block (CLB) to other SOTA attention blocks, as shown in Fig

Relationship of CLB with Existing Attention Blocks In the main paper, we introduced a Cross-Layer Block (CLB) that blends hierarchical feature maps from different encoder and decoder stages to create a unified representa- tion for Keys and Values. In this Supplementary, we il- lustrate the relationship of the proposed Cross-Layer Block (CLB) to other SOTA...

work page

[55] [55]

In this Sup- plementary, we present additional experimental comparison conducted with medium-weight and heavy-weight models on ADE20K and Cityscapes

Additional Experimental Comparisons with Medium-weight and Heavy-weight Models In the main paper, we compared the performance of the SCASeg (MiT-B0) with lightweight models. In this Sup- plementary, we present additional experimental comparison conducted with medium-weight and heavy-weight models on ADE20K and Cityscapes. 2.1. Medium-weight Models: As pre...

work page

[56] [56]

9 shows additional visual comparison of the segmen- tation results obtained on the Cityscapes datasets using our SCASeg and SOTA methods

Additional Visualization Results Fig. 9 shows additional visual comparison of the segmen- tation results obtained on the Cityscapes datasets using our SCASeg and SOTA methods

work page

[57] [57]

Additional Ablation Studies Effectiveness of the Local Perception Module (LPM): Table 3 in the main paper also presents the results of com- bining SCA with LPM, forming the complete CLB struc- ture. With the addition of LPM, the parameter count and computational load become comparable to those of CA, while this combination achieves an increase in segmenta...

work page 2019