Pyramid Attention Network for Semantic Segmentation

Hanchao Li; Jie An; Lingxue Wang; Pengfei Xiong

arxiv: 1805.10180 · v3 · pith:YG7XDCCInew · submitted 2018-05-25 · 💻 cs.CV

Pyramid Attention Network for Semantic Segmentation

Hanchao Li , Pengfei Xiong , Jie An , Lingxue Wang This is my paper

classification 💻 cs.CV

keywords attentionpyramidglobaldecoderfeaturefeaturesmodulenetwork

0 comments

read the original abstract

A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combining global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantum Feature Pyramid Gating for Seismic Image Segmentation
quant-ph 2026-05 conditional novelty 6.0

A 4-qubit quantum feature pyramid gating architecture raises mean IoU from 0.8404 to 0.9389 over classical addition in controlled ablations on the TGS salt segmentation dataset.
Grounding Synthetic Data Generation With Vision and Language Models
cs.CV 2026-03 conditional novelty 5.0

A vision-language grounded framework generates and evaluates synthetic remote sensing data, releasing ARAS400k where augmented training outperforms real-data baselines for segmentation and captioning.