pith. sign in

arxiv: 1907.05740 · v1 · pith:ZRSI7K4Tnew · submitted 2019-07-12 · 💻 cs.CV · cs.LG

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Pith reviewed 2026-05-24 22:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords semantic segmentationshape streamgated CNNCityscapes benchmarkboundary qualitytwo-stream architectureobject boundariesdeep learning
0
0 comments X

The pith

A two-stream CNN dedicates one branch to shape information and gates it with activations from the main color-texture stream to sharpen object boundaries in semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes splitting semantic segmentation into a classical stream that processes color, shape, and texture together and a parallel shape stream that focuses only on boundary information. Higher-level features from the classical stream gate the lower-level activations in the shape stream, which removes noise and lets the shape stream run at full image resolution with a shallow network. This produces sharper boundary predictions and improves results especially on thin or small objects. The architecture reaches state-of-the-art mask and boundary scores on the Cityscapes benchmark.

Core claim

The gated shape stream, wired in parallel to the classical stream and controlled by higher-level activations from that stream, lets the network process boundary information separately at image resolution; this yields sharper predictions around object boundaries and lifts both mIoU and F-score on Cityscapes by 2% and 4% over strong baselines.

What carries the argument

Gates that use higher-level classical-stream activations to modulate lower-level shape-stream activations, removing noise so the shape stream focuses only on relevant boundary cues.

If this is right

  • Sharper boundary predictions around object edges
  • Better accuracy on thinner and smaller objects
  • A very shallow shape stream suffices when operated at full image resolution
  • Joint improvement in both mask mIoU and boundary F-score metrics

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gating idea could be tested on other dense prediction tasks that benefit from explicit boundary focus, such as instance segmentation or depth estimation.
  • Because the shape stream stays shallow, the added compute cost remains modest, suggesting the approach may scale to higher-resolution inputs without proportional slowdown.
  • If the gating proves robust across datasets, it could reduce the need for post-processing steps that refine boundaries after the main network runs.

Load-bearing premise

Higher-level features from the main stream contain enough clean information to gate the shape stream without discarding useful boundary signals.

What would settle it

Running the shape stream without the gates and measuring whether boundary F-score and thin-object accuracy still improve over the single-stream baseline.

Figures

Figures reproduced from arXiv: 1907.05740 by David Acuna, Sanja Fidler, Towaki Takikawa, Varun Jampani.

Figure 1
Figure 1. Figure 1: We introduce Gated-SCNN (GSCNN), a new two-stream CNN architecture for semantic segmentation that explicitly wires shape informa￾tion as a separate processing stream. GSCNN uses a new gating mecha￾nism to connect the intermediate layers. Fusion of information between streams is done at the very end through a fusion module. To predict high￾quality boundaries, we exploit a new loss function that encourages t… view at source ↗
Figure 2
Figure 2. Figure 2: GSCNN architecture. Our architecture constitutes of two main streams. The regular stream and the shape stream. The regular stream can be any backbone architecture. The shape stream focuses on shape processing through a set of residual blocks, Gated Convolutional Layers (GCL) and supervision. A fusion module later combines information from the two streams in a multi-scale fashion using an Atrous Spatial Pyr… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the crops used for the distance-based evaluation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example output of shape stream fed into the fusion module. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of our method on the Cityscapes [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison in terms of errors in predictions. Notice that our method produces more precise boundaries, particularly for smaller and thiner objects such as poles. Boundaries around people are also sharper. Method Coarse road s.walk build. wall fence pole t-light t-sign veg terrain sky person rider car truck bus train motor bike mean PSP-Net [58] X 98.7 86.9 93.5 58.4 63.7 67.7 76.1 80.5 93.6 72.… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on the Cityscapes test set showing the high-quality boundaries of our predicted segmentation masks. Boundaries are obtained [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the alpha channels from the GCLs. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

Current state-of-the-art methods for image segmentation form a dense image representation where the color, shape and texture information are all processed together inside a deep CNN. This however may not be ideal as they contain very different type of information relevant for recognition. Here, we propose a new two-stream CNN architecture for semantic segmentation that explicitly wires shape information as a separate processing branch, i.e. shape stream, that processes information in parallel to the classical stream. Key to this architecture is a new type of gates that connect the intermediate layers of the two streams. Specifically, we use the higher-level activations in the classical stream to gate the lower-level activations in the shape stream, effectively removing noise and helping the shape stream to only focus on processing the relevant boundary-related information. This enables us to use a very shallow architecture for the shape stream that operates on the image-level resolution. Our experiments show that this leads to a highly effective architecture that produces sharper predictions around object boundaries and significantly boosts performance on thinner and smaller objects. Our method achieves state-of-the-art performance on the Cityscapes benchmark, in terms of both mask (mIoU) and boundary (F-score) quality, improving by 2% and 4% over strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Gated-SCNN, a two-stream CNN for semantic segmentation consisting of a classical stream processing color/texture and a parallel shape stream. Gates connect the streams such that higher-level activations from the classical stream gate lower-level activations in the shape stream to suppress noise and focus on boundary information. This allows a shallow shape stream at full resolution. Experiments claim state-of-the-art results on Cityscapes, with +2% mIoU and +4% boundary F-score over strong baselines, plus sharper predictions on thin/small objects.

Significance. If the empirical gains hold under controlled ablations, the explicit separation of shape processing with learned cross-stream gating offers a practical architectural motif for boundary-sensitive segmentation. The shallow shape stream is an efficiency advantage worth noting.

major comments (2)
  1. [Abstract] Abstract: the claim that the gates 'effectively remov[e] noise' and are 'key to this architecture' is load-bearing for attributing the +2% mIoU / +4% F-score gains to the gating mechanism rather than to the mere addition of a second stream; no ablation isolating the gates versus an ungated shape stream is referenced, leaving the causal contribution unverified.
  2. [Abstract] Abstract (results paragraph): the SOTA claim rests on specific numerical improvements, yet the manuscript provides no indication of whether the strong baselines share the same backbone, training schedule, or data augmentation as the proposed model; without these controls the 2%/4% deltas cannot be confidently ascribed to the architectural innovation.
minor comments (1)
  1. [Abstract] The abstract states the shape stream 'operates on the image-level resolution' but supplies no diagram or equation showing how the gating operation is implemented at that resolution (e.g., spatial alignment, channel dimensions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will revise the manuscript to improve clarity on ablations and experimental controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the gates 'effectively remov[e] noise' and are 'key to this architecture' is load-bearing for attributing the +2% mIoU / +4% F-score gains to the gating mechanism rather than to the mere addition of a second stream; no ablation isolating the gates versus an ungated shape stream is referenced, leaving the causal contribution unverified.

    Authors: The full manuscript contains an ablation study (Section 4.3) that directly compares the gated shape stream against an ungated shape stream variant, isolating the contribution of the learned gates to noise suppression and boundary focus. These results support the attribution of gains to the gating mechanism. We will revise the abstract to reference this ablation explicitly. revision: yes

  2. Referee: [Abstract] Abstract (results paragraph): the SOTA claim rests on specific numerical improvements, yet the manuscript provides no indication of whether the strong baselines share the same backbone, training schedule, or data augmentation as the proposed model; without these controls the 2%/4% deltas cannot be confidently ascribed to the architectural innovation.

    Authors: The experimental section details that all strong baselines were re-implemented and trained with identical backbone (ResNet-101), training schedule, and data augmentation as Gated-SCNN to ensure controlled comparison. We will revise the abstract to state this explicitly so the source of the reported deltas is unambiguous. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical architecture paper proposing a two-stream CNN with cross-stream gating for semantic segmentation. All central claims (sharper boundaries, +2% mIoU and +4% F-score on Cityscapes) rest on benchmark experiments rather than any mathematical derivation, first-principles result, or fitted parameter that is then renamed as a prediction. No equations, ansatzes, uniqueness theorems, or self-citations are load-bearing in the sense of the enumerated circularity patterns; the architecture is presented as a design choice validated externally by standard datasets and metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical performance of a new architecture design validated on standard benchmarks rather than first-principles derivation; the paper introduces the shape stream and gating mechanism as new components.

free parameters (1)
  • Gate parameters
    Parameters of the gating mechanism are learned from training data on the target dataset.
axioms (1)
  • domain assumption Shape information is sufficiently distinct from color and texture to benefit from separate parallel processing in CNNs for segmentation.
    The two-stream design is built directly on this separation premise.
invented entities (2)
  • Shape stream no independent evidence
    purpose: Dedicated shallow branch for processing boundary-related information at full resolution
    Newly proposed component whose value is demonstrated empirically.
  • Gates between streams no independent evidence
    purpose: Mechanism to filter noise in the shape stream using classical stream activations
    Novel connection type introduced in the architecture.

pith-pipeline@v0.9.0 · 5758 in / 1299 out tokens · 28860 ms · 2026-05-24T22:18:34.912999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 8 internal anchors

  1. [1]

    Acuna, A

    D. Acuna, A. Kar, and S. Fidler. Devil is in the edges: Learn- ing semantic boundaries from noisy annotations. In CVPR,

  2. [2]

    Acuna, H

    D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018. 1

  3. [3]

    Bottom-up Instance Segmentation using Deep Higher-Order CRFs

    A. Arnab and P. H. Torr. Bottom-up instance segmentation using deep higher-order crfs. In arXiv:1609.02583, 2016. 2

  4. [4]

    Bertasius, J

    G. Bertasius, J. Shi, and L. Torresani. Semantic segmentation with boundary neural fields. In CVPR, pages 3602–3610,

  5. [5]

    Chandra and I

    S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ference for semantic image segmentation with deep gaussian crfs. In ECCV, pages 402–418. Springer, 2016. 2

  6. [6]

    L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained do- main transform. In CVPR, pages 4545–4554, 2016. 2

  7. [7]

    L.-C. Chen, M. Collins, Y . Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. Searching for effi- cient multi-scale architectures for dense image prediction. In NIPS, pages 8713–8724, 2018. 7

  8. [8]

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep con- volutional nets and fully connected crfs. ICLR, 2015. 2

  9. [9]

    L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully con- nected crfs. T-PAMI, 40(4):834–848, April 2018. 2, 5

  10. [10]

    L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re- thinking atrous convolution for semantic image segmenta- tion. arXiv preprint arXiv:1706.05587, 2017. 7

  11. [11]

    L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for se- mantic image segmentation. In ECCV, 2018. 1, 2, 3, 5, 6, 7

  12. [12]

    Cheng, G

    D. Cheng, G. Meng, S. Xiang, and C. Pan. Fusionnet: Edge aware deep convolutional networks for semantic segmen- tation of remote sensing harbor images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(12):5769–5783, 2017. 2

  13. [13]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 5

  14. [14]

    Y . N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. In ICML, pages 933–941. JMLR. org, 2017. 2

  15. [15]

    Gadde, V

    R. Gadde, V . Jampani, M. Kiefel, D. Kappler, and P. V . Gehler. Superpixel convolutional networks using bilateral inceptions. In ECCV, pages 597–613. Springer, 2016. 1, 2

  16. [16]

    E. S. Gastal and M. M. Oliveira. Domain transform for edge- aware image and video processing. In ACM Transactions on Graphics (ToG), volume 30, page 69. ACM, 2011. 2

  17. [17]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Au- tonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012. 1

  18. [18]

    Ghiasi and C

    G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruc- tion and refinement for semantic segmentation. In ECCV, pages 519–534. Springer, 2016. 5

  19. [19]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 2, 3

  20. [20]

    He and S

    X. He and S. Gould. An Exemplar-based CRF for Multi- instance Object Segmentation. In CVPR, 2014. 2

  21. [21]

    Huang, Z

    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017. 1

  22. [22]

    Isola, J.-Y

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR,

  23. [23]

    Jampani, M

    V . Jampani, M. Kiefel, and P. V . Gehler. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In CVPR, pages 4452–4461, 2016. 2

  24. [24]

    E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 ,

  25. [25]

    Ke, J.-J

    T.-W. Ke, J.-J. Hwang, Z. Liu, and S. X. Yu. Adaptive affinity fields for semantic segmentation. In ECCV, pages 587–602,

  26. [26]

    Kendall, Y

    A. Kendall, Y . Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and seman- tics. In CVPR, pages 7482–7491, 2018. 2

  27. [27]

    Kokkinos

    I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, pages 6129– 6138, 2017. 2

  28. [28]

    Kong and C

    S. Kong and C. C. Fowlkes. Recurrent scene parsing with perspective understanding in the loop. In CVPR, pages 956– 965, 2018. 2

  29. [29]

    Kr ¨ahenb¨uhl and V

    P. Kr ¨ahenb¨uhl and V . Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, pages 109–117, 2011. 2

  30. [30]

    D. C. Lee, M. Hebert, and T. Kanade. Geometric reason- ing for single image structure recovery. CVPR, pages 2136– 2143, 2009. 1

  31. [31]

    G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmenta- tion. In CVPR, pages 1925–1934, 2017. 2

  32. [32]

    G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, pages 3194–3203, 2016. 2, 5

  33. [33]

    H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler. Fast in- teractive object annotation with curve-gcn. In CVPR, 2019. 1

  34. [34]

    Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

    C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. Fei-Fei. Auto-deeplab: Hierarchical neural architec- ture search for semantic image segmentation. arXiv preprint arXiv:1901.02985, 2019. 7

  35. [35]

    S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz. Learning affinity via spatial propagation networks. In NIPS, pages 1520–1530, 2017. 1, 2

  36. [36]

    Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic im- age segmentation via deep parsing network. In ICCV, pages 1377–1385, 2015. 2

  37. [37]

    J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In CVPR, 2015. 1, 2

  38. [38]

    Malik and D

    J. Malik and D. E. Maydan. Recovering three-dimensional shape from a single image of curved objects. T-PAMI, 11(6):555–566, 1989. 1

  39. [39]

    Misra, A

    I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross- stitch networks for multi-task learning. In CVPR, pages 3994–4003, 2016. 2

  40. [40]

    C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolu- tional network. In CVPR, pages 4353–4361, 2017. 2

  41. [41]

    Perazzi, J

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, pages 724–732, 2016. 5

  42. [42]

    Pohlen, A

    T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full- resolution residual networks for semantic segmentation in street scenes. CVPR, 2017. 1, 2

  43. [43]

    A. G. Schwing and R. Urtasun. Fully Connected Deep Struc- tured Networks. arXiv:1503.02351, 2015. 2

  44. [44]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3

  45. [45]

    Teichmann, M

    M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun. Multinet: Real-time joint semantic reasoning for autonomous driving. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1013–1020. IEEE, 2018. 2

  46. [46]

    Van den Oord, N

    A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In NIPS, pages 4790–4798, 2016. 2

  47. [47]

    Wang, M.-Y

    T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018. 1

  48. [48]

    X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, pages 7794–7803, 2018. 2

  49. [49]

    T. Wu, S. Tang, R. Zhang, and J. Li. Tree-structured kro- necker convolutional networks for semantic segmentation. arXiv preprint arXiv:1812.04945, 2018. 7

  50. [50]

    Xie and Z

    S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, pages 1395–1403, 2015. 4

  51. [51]

    Yu and V

    F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. ICLR, 2016. 1

  52. [52]

    F. Yu, D. Wang, E. Shelhamer, and T. Darrell. Deep layer aggregation. In CVPR, pages 2403–2412, 2018. 1

  53. [53]

    J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589, 2018. 2

  54. [54]

    Z. Yu, C. Feng, M.-Y . Liu, and S. Ramalingam. CASENet: Deep category-aware semantic edge detection. In CVPR,

  55. [55]

    Z. Yu, W. Liu, Y . Zou, C. Feng, S. Ramalingam, B. Vi- jaya Kumar, and J. Kautz. Simultaneous edge alignment and learning. In ECCV, 2018. 5

  56. [56]

    Wide Residual Networks

    S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 2, 3

  57. [57]

    Zhang, S

    Z. Zhang, S. Fidler, and R. Urtasun. Instance-level segmen- tation for autonomous driving with deep densely connected mrfs. In CVPR, 2016. 1

  58. [58]

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017. 1, 2, 5, 7

  59. [59]

    Zheng, S

    S. Zheng, S. Jayasumana, B. Romera-Paredes, V . Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In ICCV, pages 1529– 1537, 2015. 2