pith. sign in

arxiv: 1907.10958 · v1 · pith:VZQZYQ4Rnew · submitted 2019-07-25 · 💻 cs.CV

Cross Attention Network for Semantic Segmentation

Pith reviewed 2026-05-24 16:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationcross attentionfeature fusionreal-time segmentationCityscapesCamVidattention mechanismslightweight networks
0
0 comments X

The pith

A cross-attention module that pulls spatial attention from a shallow branch and channel attention from a deep branch improves semantic segmentation accuracy and speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a Cross Attention Network with two branches and a Feature Cross Attention module to handle semantic segmentation. One shallow branch preserves low-level spatial details while the other deep branch extracts high-level contextual information. The module computes a spatial attention map from one branch and a channel attention map from the other, then fuses them so that context guides the overall features and spatial cues sharpen local boundaries. Experiments show this design beats competing real-time approaches on Cityscapes and CamVid when paired with lightweight backbones and reaches state-of-the-art numbers with deeper backbones.

Core claim

The central claim is that the Feature Cross Attention module produces superior fused representations by separately deriving a spatial attention map from the shallow branch and a channel attention map from the deep branch before combining them, allowing contextual features to supply global guidance while spatial features refine localizations.

What carries the argument

The Feature Cross Attention (FCA) module, which derives a spatial attention map from one branch and a channel attention map from the other before fusing the two.

If this is right

  • The network outperforms other real-time methods with improved speed on Cityscapes and CamVid when using lightweight backbones.
  • It reaches state-of-the-art performance on the same datasets when a deep backbone is used.
  • Contextual features from the deep branch supply global guidance to the fused maps while spatial features from the shallow branch refine localizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cross-branch attention pattern could be tested on other dense prediction tasks such as depth estimation to check whether the same fusion benefit appears.
  • Because the module keeps branches separate until the final fusion step, it may scale to higher-resolution inputs with less memory growth than joint attention designs.
  • The reported speed gains suggest the architecture could support real-time video segmentation without additional hardware-specific optimizations.

Load-bearing premise

That separately deriving and then fusing a spatial attention map from one branch with a channel attention map from the other yields a better combined representation than ordinary feature concatenation or existing attention blocks.

What would settle it

An ablation study on Cityscapes in which the FCA module is replaced by simple concatenation or a standard attention block and segmentation accuracy stays the same or improves.

read the original abstract

In this paper, we address the semantic segmentation task with a deep network that combines contextual features and spatial information. The proposed Cross Attention Network is composed of two branches and a Feature Cross Attention (FCA) module. Specifically, a shallow branch is used to preserve low-level spatial information and a deep branch is employed to extract high-level contextual features. Then the FCA module is introduced to combine these two branches. Different from most existing attention mechanisms, the FCA module obtains spatial attention map and channel attention map from two branches separately, and then fuses them. The contextual features are used to provide global contextual guidance in fused feature maps, and spatial features are used to refine localizations. The proposed network outperforms other real-time methods with improved speed on the Cityscapes and CamVid datasets with lightweight backbones, and achieves state-of-the-art performance with a deep backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Cross Attention Network (CAN) for semantic segmentation. It consists of a shallow branch to preserve low-level spatial information, a deep branch to extract high-level contextual features, and a Feature Cross Attention (FCA) module that derives a spatial attention map from one branch and a channel attention map from the other before fusing them. The contextual features provide global guidance while spatial features refine localizations. The authors claim that CAN outperforms other real-time methods with improved speed on Cityscapes and CamVid using lightweight backbones and achieves state-of-the-art results with a deep backbone.

Significance. If the reported gains hold under rigorous controls and the FCA module's specific cross-branch design is shown to be responsible for the improvements rather than the dual-branch architecture alone, the work would offer a targeted contribution to efficient attention-based fusion in semantic segmentation. The approach extends existing attention mechanisms in a structured way that could influence real-time model design on standard benchmarks.

major comments (2)
  1. [Experiments section (Tables reporting Cityscapes/CamVid results)] The central empirical claim rests on the FCA module outperforming standard fusion, yet the experiments provide no ablation that directly compares the proposed cross-attention fusion (spatial map from shallow branch + channel map from deep branch) against direct concatenation of the two branches or against established attention blocks such as SENet or CBAM. Without this isolation, the headline attribution to FCA cannot be evaluated.
  2. [Section 3 (FCA module description) and corresponding results tables] The abstract and method description assert that the fused maps supply 'global contextual guidance' and 'refine localizations,' but no quantitative breakdown (e.g., per-class IoU deltas or feature-map visualizations) demonstrates that these benefits arise specifically from the cross-branch attention derivation rather than from the dual-branch backbone itself.
minor comments (1)
  1. [Abstract] The abstract supplies no numerical results, metrics, or baseline comparisons, which is atypical for an empirical computer-vision paper and forces the reader to reach the experiments section before any assessment of the claims is possible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on isolating the contribution of the FCA module. We address the two major comments below and will update the manuscript with additional experiments and analyses.

read point-by-point responses
  1. Referee: [Experiments section (Tables reporting Cityscapes/CamVid results)] The central empirical claim rests on the FCA module outperforming standard fusion, yet the experiments provide no ablation that directly compares the proposed cross-attention fusion (spatial map from shallow branch + channel map from deep branch) against direct concatenation of the two branches or against established attention blocks such as SENet or CBAM. Without this isolation, the headline attribution to FCA cannot be evaluated.

    Authors: We agree that the current experiments do not include a direct ablation isolating the cross-attention fusion against concatenation or blocks such as SENet and CBAM. While the manuscript reports gains over competing real-time methods on Cityscapes and CamVid, these do not fully separate the FCA design from the dual-branch architecture. We will add the requested ablation studies in the revised manuscript. revision: yes

  2. Referee: [Section 3 (FCA module description) and corresponding results tables] The abstract and method description assert that the fused maps supply 'global contextual guidance' and 'refine localizations,' but no quantitative breakdown (e.g., per-class IoU deltas or feature-map visualizations) demonstrates that these benefits arise specifically from the cross-branch attention derivation rather than from the dual-branch backbone itself.

    Authors: We acknowledge that the manuscript lacks per-class IoU breakdowns and feature-map visualizations that would more directly attribute the claimed guidance and localization benefits to the cross-branch attention rather than the dual-branch structure alone. We will incorporate these quantitative and visual analyses in the revision to strengthen the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on public dataset benchmarks

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivations. The FCA module is described architecturally (separate spatial/channel attention maps fused across branches) without any self-referential definitions or predictions that reduce to inputs. Performance claims are external evaluations on Cityscapes and CamVid; no self-citation chains or ansatzes are invoked in the given text. This is a standard non-circular empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5664 in / 1014 out tokens · 25304 ms · 2026-05-24T16:10:14.339250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

  1. [1]

    It has wide applications in satellite imagery analysis, medical im- age diagnostics, indoor scene understanding, etc

    INTRODUCTION Semantic segmentation, which assigns class labels to image pixels, is a fundamental problem in computer vision. It has wide applications in satellite imagery analysis, medical im- age diagnostics, indoor scene understanding, etc. Significant progress has been made recently with the Fully Convolutional Network (FCN) [1], replacing the fully con...

  2. [2]

    and CamVid [8], with fewer parameters and faster speed compared to the existing methods

  3. [3]

    Long et al

    RELA TED WORK Semantic segmentation. Long et al . proposed the FCN

  4. [4]

    Cross Attention Network for Semantic Segmentation

    to take arbitrary sized input and produce corresponding segmentation map. Skipping connections were introduced to combine coarse and fine predictions to obtain denser feature maps. Chen et al. proposed a series of segmentation networks called DeepLab [3, 9] and employed atrous convolutions to increase the field of view and maintain the spatial resolution wi...

  5. [5]

    In [3], an Atrous Spatial Pyramid Pooling module was embedded at the end of the network to capture multi-scale information

    performs multi-scale spatial pooling at the final feature maps to capture global features. In [3], an Atrous Spatial Pyramid Pooling module was embedded at the end of the network to capture multi-scale information. Recent research showed that preserving spatial details helped achieve good re- sults [12, 6]. DeepLab [3], PSPNet and DUC [13] employ atrous co...

  6. [6]

    Two branches Based on the existing methods, the two-branch architecture can encode spatial information and extract deep contextual features

    METHODS 3.1. Two branches Based on the existing methods, the two-branch architecture can encode spatial information and extract deep contextual features. In this architecture, a shallow branch is designed for preserving spatial information and a deep network is em- ployed for capturing context. In the proposed CANet, the spatial branch only consists of th...

  7. [7]

    Mo- bileNetV2 builds upon the idea of depthwise separable con- volutions and can be used as a powerful feature extractor with high efficiency

    is employed as the backbone of the context path. Mo- bileNetV2 builds upon the idea of depthwise separable con- volutions and can be used as a powerful feature extractor with high efficiency. In the context branch, the final convolutional layer of MobileNetV2 is discarded. The features of the last two stages are upsampled by deconvolutions and concate- nate...

  8. [8]

    We first introduce the implementation protocol Fig

    EXPERIMENTAL RESULTS We evaluated our CANets on two benchmark datasets: the CamVid road scenes dataset and the urban scene dataset Cityscapes. We first introduce the implementation protocol Fig. 2: Architecture of the Cross Attention Network. and conduct ablation studies on the Cityscapes validation dataset, and finally we report the results on Cityscapes a...

  9. [9]

    The results are shown in Table 2

    with CUDA 10.0, and each network was randomly ini- tialized and evaluated for 100 times. The results are shown in Table 2. Next, we trained and evaluated CANet1 and CANet2 at 1024 ×512 and accuracies on the test set are shown in Table 2. The results show that CANets outperformed other real-time methods with faster speed on the Cityscape dataset. Next, we ...

  10. [10]

    We design a two-branch network to extract high-level contextual features and encode low-level spatial information simultaneously

    CONCLUSIONS This paper presents a new Cross Attention Network (CANet) for semantic segmentation. We design a two-branch network to extract high-level contextual features and encode low-level spatial information simultaneously. In the context branch, lightweight networks are employed to reduce computational cost, a Feature Cross Attention (FCA) module is p...

  11. [11]

    Fully convolu- tional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- tional networks for semantic segmentation,” in CVPR, 2015

  12. [12]

    Multi-Scale Context Aggregation by Dilated Convolutions

    F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv:1511.07122, 2015

  13. [13]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” PAMI, 2018

  14. [14]

    U-net: Convo- lutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” in MICCAI, 2015

  15. [15]

    Bisenet: Bilateral segmentation network for real-time semantic segmentation,

    C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018

  16. [16]

    Icnet for real-time semantic segmentation on high-resolution im- ages,

    H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution im- ages,” in ECCV, 2018

  17. [17]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016

  18. [18]

    Segmentation and recognition using structure from mo- tion point clouds,

    G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from mo- tion point clouds,” in ECCV, 2008

  19. [19]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re- thinking atrous convolution for semantic image segmen- tation,” arXiv:1706.05587, 2017

  20. [20]

    The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,

    S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y . Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in CVPRW, 2017

  21. [21]

    Refinenet: Multi-path refinement networks for high-resolution se- mantic segmentation.,

    G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path refinement networks for high-resolution se- mantic segmentation.,” in CVPR, 2017

  22. [22]

    Pyramid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017

  23. [23]

    Understanding convolution for seman- tic segmentation,

    P. Wang, P. Chen, Y . Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for seman- tic segmentation,” in WACV, 2018

  24. [24]

    Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks

    A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel squeeze & excitation in fully convo- lutional networks,” arXiv:1803.02579, 2018

  25. [25]

    Context encoding for semantic seg- mentation,

    H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic seg- mentation,” in CVPR, 2018

  26. [26]

    Pyramid attention network for semantic segmentation,

    H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” in BMVC, 2018

  27. [27]

    Dual Attention Network for Scene Segmentation

    J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” arXiv:1809.02983, 2018

  28. [28]

    Mobilenetv2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018

  29. [29]

    Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,” in ICML, 2015

  30. [30]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012

  31. [31]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

  32. [32]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014

  33. [33]

    Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,

    E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Ar- royo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” ITSS, 2018

  34. [34]

    Automatic differentiation in pytorch,

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017

  35. [35]

    Video scene parsing with predictive feature learning,

    X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, et al., “Video scene parsing with predictive feature learning,” in ICCV, 2017

  36. [36]

    Scale- adaptive convolutions for scene parsing,

    R. Zhang, S. Tang, Y . Zhang, J. Li, and S. Yan, “Scale- adaptive convolutions for scene parsing,” in ICCV, 2017

  37. [37]

    Recurrent scene parsing with perspective understanding in the loop,

    S. Kong and C. Fowlkes, “Recurrent scene parsing with perspective understanding in the loop,” in CVPR, 2018

  38. [38]

    Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

    A. Kendall, V . Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convo- lutional encoder-decoder architectures for scene under- standing,” arXiv:1511.02680, 2015

  39. [39]

    Feature space op- timization for semantic video segmentation,

    A. Kundu, V . Vineet, and V . Koltun, “Feature space op- timization for semantic video segmentation,” in CVPR, 2016