Cross Attention Network for Semantic Segmentation

Hujun Yin; Mengyu Liu

arxiv: 1907.10958 · v1 · pith:VZQZYQ4Rnew · submitted 2019-07-25 · 💻 cs.CV

Cross Attention Network for Semantic Segmentation

Mengyu Liu , Hujun Yin This is my paper

Pith reviewed 2026-05-24 16:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationcross attentionfeature fusionreal-time segmentationCityscapesCamVidattention mechanismslightweight networks

0 comments

The pith

A cross-attention module that pulls spatial attention from a shallow branch and channel attention from a deep branch improves semantic segmentation accuracy and speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a Cross Attention Network with two branches and a Feature Cross Attention module to handle semantic segmentation. One shallow branch preserves low-level spatial details while the other deep branch extracts high-level contextual information. The module computes a spatial attention map from one branch and a channel attention map from the other, then fuses them so that context guides the overall features and spatial cues sharpen local boundaries. Experiments show this design beats competing real-time approaches on Cityscapes and CamVid when paired with lightweight backbones and reaches state-of-the-art numbers with deeper backbones.

Core claim

The central claim is that the Feature Cross Attention module produces superior fused representations by separately deriving a spatial attention map from the shallow branch and a channel attention map from the deep branch before combining them, allowing contextual features to supply global guidance while spatial features refine localizations.

What carries the argument

The Feature Cross Attention (FCA) module, which derives a spatial attention map from one branch and a channel attention map from the other before fusing the two.

If this is right

The network outperforms other real-time methods with improved speed on Cityscapes and CamVid when using lightweight backbones.
It reaches state-of-the-art performance on the same datasets when a deep backbone is used.
Contextual features from the deep branch supply global guidance to the fused maps while spatial features from the shallow branch refine localizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cross-branch attention pattern could be tested on other dense prediction tasks such as depth estimation to check whether the same fusion benefit appears.
Because the module keeps branches separate until the final fusion step, it may scale to higher-resolution inputs with less memory growth than joint attention designs.
The reported speed gains suggest the architecture could support real-time video segmentation without additional hardware-specific optimizations.

Load-bearing premise

That separately deriving and then fusing a spatial attention map from one branch with a channel attention map from the other yields a better combined representation than ordinary feature concatenation or existing attention blocks.

What would settle it

An ablation study on Cityscapes in which the FCA module is replaced by simple concatenation or a standard attention block and segmentation accuracy stays the same or improves.

read the original abstract

In this paper, we address the semantic segmentation task with a deep network that combines contextual features and spatial information. The proposed Cross Attention Network is composed of two branches and a Feature Cross Attention (FCA) module. Specifically, a shallow branch is used to preserve low-level spatial information and a deep branch is employed to extract high-level contextual features. Then the FCA module is introduced to combine these two branches. Different from most existing attention mechanisms, the FCA module obtains spatial attention map and channel attention map from two branches separately, and then fuses them. The contextual features are used to provide global contextual guidance in fused feature maps, and spatial features are used to refine localizations. The proposed network outperforms other real-time methods with improved speed on the Cityscapes and CamVid datasets with lightweight backbones, and achieves state-of-the-art performance with a deep backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a dual-branch setup with a Feature Cross Attention module that pulls spatial attention from the shallow branch and channel attention from the deep branch before fusing, but the abstract gives no numbers or ablations to show whether this beats standard concatenation.

read the letter

The paper's core proposal is a Cross Attention Network with two branches and a Feature Cross Attention module. The shallow branch keeps low-level spatial details while the deep branch extracts high-level context. The FCA module then takes a spatial attention map from one branch and a channel attention map from the other, fuses them, and uses the result to guide the features. This cross-branch sourcing of the attention maps is the specific new design element compared with prior attention blocks that typically drew both maps from the same features or used simpler fusion.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Cross Attention Network (CAN) for semantic segmentation. It consists of a shallow branch to preserve low-level spatial information, a deep branch to extract high-level contextual features, and a Feature Cross Attention (FCA) module that derives a spatial attention map from one branch and a channel attention map from the other before fusing them. The contextual features provide global guidance while spatial features refine localizations. The authors claim that CAN outperforms other real-time methods with improved speed on Cityscapes and CamVid using lightweight backbones and achieves state-of-the-art results with a deep backbone.

Significance. If the reported gains hold under rigorous controls and the FCA module's specific cross-branch design is shown to be responsible for the improvements rather than the dual-branch architecture alone, the work would offer a targeted contribution to efficient attention-based fusion in semantic segmentation. The approach extends existing attention mechanisms in a structured way that could influence real-time model design on standard benchmarks.

major comments (2)

[Experiments section (Tables reporting Cityscapes/CamVid results)] The central empirical claim rests on the FCA module outperforming standard fusion, yet the experiments provide no ablation that directly compares the proposed cross-attention fusion (spatial map from shallow branch + channel map from deep branch) against direct concatenation of the two branches or against established attention blocks such as SENet or CBAM. Without this isolation, the headline attribution to FCA cannot be evaluated.
[Section 3 (FCA module description) and corresponding results tables] The abstract and method description assert that the fused maps supply 'global contextual guidance' and 'refine localizations,' but no quantitative breakdown (e.g., per-class IoU deltas or feature-map visualizations) demonstrates that these benefits arise specifically from the cross-branch attention derivation rather than from the dual-branch backbone itself.

minor comments (1)

[Abstract] The abstract supplies no numerical results, metrics, or baseline comparisons, which is atypical for an empirical computer-vision paper and forces the reader to reach the experiments section before any assessment of the claims is possible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on isolating the contribution of the FCA module. We address the two major comments below and will update the manuscript with additional experiments and analyses.

read point-by-point responses

Referee: [Experiments section (Tables reporting Cityscapes/CamVid results)] The central empirical claim rests on the FCA module outperforming standard fusion, yet the experiments provide no ablation that directly compares the proposed cross-attention fusion (spatial map from shallow branch + channel map from deep branch) against direct concatenation of the two branches or against established attention blocks such as SENet or CBAM. Without this isolation, the headline attribution to FCA cannot be evaluated.

Authors: We agree that the current experiments do not include a direct ablation isolating the cross-attention fusion against concatenation or blocks such as SENet and CBAM. While the manuscript reports gains over competing real-time methods on Cityscapes and CamVid, these do not fully separate the FCA design from the dual-branch architecture. We will add the requested ablation studies in the revised manuscript. revision: yes
Referee: [Section 3 (FCA module description) and corresponding results tables] The abstract and method description assert that the fused maps supply 'global contextual guidance' and 'refine localizations,' but no quantitative breakdown (e.g., per-class IoU deltas or feature-map visualizations) demonstrates that these benefits arise specifically from the cross-branch attention derivation rather than from the dual-branch backbone itself.

Authors: We acknowledge that the manuscript lacks per-class IoU breakdowns and feature-map visualizations that would more directly attribute the claimed guidance and localization benefits to the cross-branch attention rather than the dual-branch structure alone. We will incorporate these quantitative and visual analyses in the revision to strengthen the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on public dataset benchmarks

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivations. The FCA module is described architecturally (separate spatial/channel attention maps fused across branches) without any self-referential definitions or predictions that reduce to inputs. Performance claims are external evaluations on Cityscapes and CamVid; no self-citation chains or ansatzes are invoked in the given text. This is a standard non-circular empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5664 in / 1014 out tokens · 25304 ms · 2026-05-24T16:10:14.339250+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The FCA module obtains spatial attention map and channel attention map from two branches separately, and then fuses them. The contextual features are used to provide global contextual guidance in fused feature maps, and spatial features are used to refine localizations.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a shallow branch is used to preserve low-level spatial information and a deep branch is employed to extract high-level contextual features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

It has wide applications in satellite imagery analysis, medical im- age diagnostics, indoor scene understanding, etc

INTRODUCTION Semantic segmentation, which assigns class labels to image pixels, is a fundamental problem in computer vision. It has wide applications in satellite imagery analysis, medical im- age diagnostics, indoor scene understanding, etc. Signiﬁcant progress has been made recently with the Fully Convolutional Network (FCN) [1], replacing the fully con...

work page
[2]

and CamVid [8], with fewer parameters and faster speed compared to the existing methods

work page
[3]

Long et al

RELA TED WORK Semantic segmentation. Long et al . proposed the FCN

work page
[4]

Cross Attention Network for Semantic Segmentation

to take arbitrary sized input and produce corresponding segmentation map. Skipping connections were introduced to combine coarse and ﬁne predictions to obtain denser feature maps. Chen et al. proposed a series of segmentation networks called DeepLab [3, 9] and employed atrous convolutions to increase the ﬁeld of view and maintain the spatial resolution wi...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[5]

In [3], an Atrous Spatial Pyramid Pooling module was embedded at the end of the network to capture multi-scale information

performs multi-scale spatial pooling at the ﬁnal feature maps to capture global features. In [3], an Atrous Spatial Pyramid Pooling module was embedded at the end of the network to capture multi-scale information. Recent research showed that preserving spatial details helped achieve good re- sults [12, 6]. DeepLab [3], PSPNet and DUC [13] employ atrous co...

work page
[6]

Two branches Based on the existing methods, the two-branch architecture can encode spatial information and extract deep contextual features

METHODS 3.1. Two branches Based on the existing methods, the two-branch architecture can encode spatial information and extract deep contextual features. In this architecture, a shallow branch is designed for preserving spatial information and a deep network is em- ployed for capturing context. In the proposed CANet, the spatial branch only consists of th...

work page
[7]

Mo- bileNetV2 builds upon the idea of depthwise separable con- volutions and can be used as a powerful feature extractor with high efﬁciency

is employed as the backbone of the context path. Mo- bileNetV2 builds upon the idea of depthwise separable con- volutions and can be used as a powerful feature extractor with high efﬁciency. In the context branch, the ﬁnal convolutional layer of MobileNetV2 is discarded. The features of the last two stages are upsampled by deconvolutions and concate- nate...

work page
[8]

We ﬁrst introduce the implementation protocol Fig

EXPERIMENTAL RESULTS We evaluated our CANets on two benchmark datasets: the CamVid road scenes dataset and the urban scene dataset Cityscapes. We ﬁrst introduce the implementation protocol Fig. 2: Architecture of the Cross Attention Network. and conduct ablation studies on the Cityscapes validation dataset, and ﬁnally we report the results on Cityscapes a...

work page 2018
[9]

The results are shown in Table 2

with CUDA 10.0, and each network was randomly ini- tialized and evaluated for 100 times. The results are shown in Table 2. Next, we trained and evaluated CANet1 and CANet2 at 1024 ×512 and accuracies on the test set are shown in Table 2. The results show that CANets outperformed other real-time methods with faster speed on the Cityscape dataset. Next, we ...

work page
[10]

We design a two-branch network to extract high-level contextual features and encode low-level spatial information simultaneously

CONCLUSIONS This paper presents a new Cross Attention Network (CANet) for semantic segmentation. We design a two-branch network to extract high-level contextual features and encode low-level spatial information simultaneously. In the context branch, lightweight networks are employed to reduce computational cost, a Feature Cross Attention (FCA) module is p...

work page
[11]

Fully convolu- tional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- tional networks for semantic segmentation,” in CVPR, 2015

work page 2015
[12]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv:1511.07122, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” PAMI, 2018

work page 2018
[14]

U-net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” in MICCAI, 2015

work page 2015
[15]

Bisenet: Bilateral segmentation network for real-time semantic segmentation,

C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018

work page 2018
[16]

Icnet for real-time semantic segmentation on high-resolution im- ages,

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution im- ages,” in ECCV, 2018

work page 2018
[17]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016

work page 2016
[18]

Segmentation and recognition using structure from mo- tion point clouds,

G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from mo- tion point clouds,” in ECCV, 2008

work page 2008
[19]

Rethinking Atrous Convolution for Semantic Image Segmentation

L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re- thinking atrous convolution for semantic image segmen- tation,” arXiv:1706.05587, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,

S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y . Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in CVPRW, 2017

work page 2017
[21]

Reﬁnenet: Multi-path reﬁnement networks for high-resolution se- mantic segmentation.,

G. Lin, A. Milan, C. Shen, and I. D. Reid, “Reﬁnenet: Multi-path reﬁnement networks for high-resolution se- mantic segmentation.,” in CVPR, 2017

work page 2017
[22]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017

work page 2017
[23]

Understanding convolution for seman- tic segmentation,

P. Wang, P. Chen, Y . Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for seman- tic segmentation,” in WACV, 2018

work page 2018
[24]

Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks

A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel squeeze & excitation in fully convo- lutional networks,” arXiv:1803.02579, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Context encoding for semantic seg- mentation,

H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic seg- mentation,” in CVPR, 2018

work page 2018
[26]

Pyramid attention network for semantic segmentation,

H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” in BMVC, 2018

work page 2018
[27]

Dual Attention Network for Scene Segmentation

J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” arXiv:1809.02983, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Mobilenetv2: Inverted residuals and linear bottlenecks,

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018

work page 2018
[29]

Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,” in ICML, 2015

work page 2015
[30]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in NIPS, 2012

work page 2012
[31]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

work page 2016
[32]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

Erfnet: Efﬁcient residual factorized convnet for real-time semantic segmentation,

E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Ar- royo, “Erfnet: Efﬁcient residual factorized convnet for real-time semantic segmentation,” ITSS, 2018

work page 2018
[34]

Automatic differentiation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017

work page 2017
[35]

Video scene parsing with predictive feature learning,

X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, et al., “Video scene parsing with predictive feature learning,” in ICCV, 2017

work page 2017
[36]

Scale- adaptive convolutions for scene parsing,

R. Zhang, S. Tang, Y . Zhang, J. Li, and S. Yan, “Scale- adaptive convolutions for scene parsing,” in ICCV, 2017

work page 2017
[37]

Recurrent scene parsing with perspective understanding in the loop,

S. Kong and C. Fowlkes, “Recurrent scene parsing with perspective understanding in the loop,” in CVPR, 2018

work page 2018
[38]

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

A. Kendall, V . Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convo- lutional encoder-decoder architectures for scene under- standing,” arXiv:1511.02680, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

Feature space op- timization for semantic video segmentation,

A. Kundu, V . Vineet, and V . Koltun, “Feature space op- timization for semantic video segmentation,” in CVPR, 2016

work page 2016

[1] [1]

It has wide applications in satellite imagery analysis, medical im- age diagnostics, indoor scene understanding, etc

INTRODUCTION Semantic segmentation, which assigns class labels to image pixels, is a fundamental problem in computer vision. It has wide applications in satellite imagery analysis, medical im- age diagnostics, indoor scene understanding, etc. Signiﬁcant progress has been made recently with the Fully Convolutional Network (FCN) [1], replacing the fully con...

work page

[2] [2]

and CamVid [8], with fewer parameters and faster speed compared to the existing methods

work page

[3] [3]

Long et al

RELA TED WORK Semantic segmentation. Long et al . proposed the FCN

work page

[4] [4]

Cross Attention Network for Semantic Segmentation

to take arbitrary sized input and produce corresponding segmentation map. Skipping connections were introduced to combine coarse and ﬁne predictions to obtain denser feature maps. Chen et al. proposed a series of segmentation networks called DeepLab [3, 9] and employed atrous convolutions to increase the ﬁeld of view and maintain the spatial resolution wi...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[5] [5]

In [3], an Atrous Spatial Pyramid Pooling module was embedded at the end of the network to capture multi-scale information

performs multi-scale spatial pooling at the ﬁnal feature maps to capture global features. In [3], an Atrous Spatial Pyramid Pooling module was embedded at the end of the network to capture multi-scale information. Recent research showed that preserving spatial details helped achieve good re- sults [12, 6]. DeepLab [3], PSPNet and DUC [13] employ atrous co...

work page

[6] [6]

Two branches Based on the existing methods, the two-branch architecture can encode spatial information and extract deep contextual features

METHODS 3.1. Two branches Based on the existing methods, the two-branch architecture can encode spatial information and extract deep contextual features. In this architecture, a shallow branch is designed for preserving spatial information and a deep network is em- ployed for capturing context. In the proposed CANet, the spatial branch only consists of th...

work page

[7] [7]

Mo- bileNetV2 builds upon the idea of depthwise separable con- volutions and can be used as a powerful feature extractor with high efﬁciency

is employed as the backbone of the context path. Mo- bileNetV2 builds upon the idea of depthwise separable con- volutions and can be used as a powerful feature extractor with high efﬁciency. In the context branch, the ﬁnal convolutional layer of MobileNetV2 is discarded. The features of the last two stages are upsampled by deconvolutions and concate- nate...

work page

[8] [8]

We ﬁrst introduce the implementation protocol Fig

EXPERIMENTAL RESULTS We evaluated our CANets on two benchmark datasets: the CamVid road scenes dataset and the urban scene dataset Cityscapes. We ﬁrst introduce the implementation protocol Fig. 2: Architecture of the Cross Attention Network. and conduct ablation studies on the Cityscapes validation dataset, and ﬁnally we report the results on Cityscapes a...

work page 2018

[9] [9]

The results are shown in Table 2

with CUDA 10.0, and each network was randomly ini- tialized and evaluated for 100 times. The results are shown in Table 2. Next, we trained and evaluated CANet1 and CANet2 at 1024 ×512 and accuracies on the test set are shown in Table 2. The results show that CANets outperformed other real-time methods with faster speed on the Cityscape dataset. Next, we ...

work page

[10] [10]

We design a two-branch network to extract high-level contextual features and encode low-level spatial information simultaneously

CONCLUSIONS This paper presents a new Cross Attention Network (CANet) for semantic segmentation. We design a two-branch network to extract high-level contextual features and encode low-level spatial information simultaneously. In the context branch, lightweight networks are employed to reduce computational cost, a Feature Cross Attention (FCA) module is p...

work page

[11] [11]

Fully convolu- tional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- tional networks for semantic segmentation,” in CVPR, 2015

work page 2015

[12] [12]

Multi-Scale Context Aggregation by Dilated Convolutions

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv:1511.07122, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” PAMI, 2018

work page 2018

[14] [14]

U-net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” in MICCAI, 2015

work page 2015

[15] [15]

Bisenet: Bilateral segmentation network for real-time semantic segmentation,

C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018

work page 2018

[16] [16]

Icnet for real-time semantic segmentation on high-resolution im- ages,

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution im- ages,” in ECCV, 2018

work page 2018

[17] [17]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016

work page 2016

[18] [18]

Segmentation and recognition using structure from mo- tion point clouds,

G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from mo- tion point clouds,” in ECCV, 2008

work page 2008

[19] [19]

Rethinking Atrous Convolution for Semantic Image Segmentation

L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re- thinking atrous convolution for semantic image segmen- tation,” arXiv:1706.05587, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,

S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y . Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in CVPRW, 2017

work page 2017

[21] [21]

Reﬁnenet: Multi-path reﬁnement networks for high-resolution se- mantic segmentation.,

G. Lin, A. Milan, C. Shen, and I. D. Reid, “Reﬁnenet: Multi-path reﬁnement networks for high-resolution se- mantic segmentation.,” in CVPR, 2017

work page 2017

[22] [22]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017

work page 2017

[23] [23]

Understanding convolution for seman- tic segmentation,

P. Wang, P. Chen, Y . Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for seman- tic segmentation,” in WACV, 2018

work page 2018

[24] [24]

Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks

A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel squeeze & excitation in fully convo- lutional networks,” arXiv:1803.02579, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Context encoding for semantic seg- mentation,

H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic seg- mentation,” in CVPR, 2018

work page 2018

[26] [26]

Pyramid attention network for semantic segmentation,

H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” in BMVC, 2018

work page 2018

[27] [27]

Dual Attention Network for Scene Segmentation

J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” arXiv:1809.02983, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Mobilenetv2: Inverted residuals and linear bottlenecks,

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018

work page 2018

[29] [29]

Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,” in ICML, 2015

work page 2015

[30] [30]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in NIPS, 2012

work page 2012

[31] [31]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

work page 2016

[32] [32]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

Erfnet: Efﬁcient residual factorized convnet for real-time semantic segmentation,

E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Ar- royo, “Erfnet: Efﬁcient residual factorized convnet for real-time semantic segmentation,” ITSS, 2018

work page 2018

[34] [34]

Automatic differentiation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017

work page 2017

[35] [35]

Video scene parsing with predictive feature learning,

X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, et al., “Video scene parsing with predictive feature learning,” in ICCV, 2017

work page 2017

[36] [36]

Scale- adaptive convolutions for scene parsing,

R. Zhang, S. Tang, Y . Zhang, J. Li, and S. Yan, “Scale- adaptive convolutions for scene parsing,” in ICCV, 2017

work page 2017

[37] [37]

Recurrent scene parsing with perspective understanding in the loop,

S. Kong and C. Fowlkes, “Recurrent scene parsing with perspective understanding in the loop,” in CVPR, 2018

work page 2018

[38] [38]

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

A. Kendall, V . Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convo- lutional encoder-decoder architectures for scene under- standing,” arXiv:1511.02680, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

Feature space op- timization for semantic video segmentation,

A. Kundu, V . Vineet, and V . Koltun, “Feature space op- timization for semantic video segmentation,” in CVPR, 2016

work page 2016