ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation

Hanjiang Xiong; Jianya Gong; Linxi Huan; Xianwei Zheng

arxiv: 1906.11428 · v1 · pith:4HMYKA5Xnew · submitted 2019-06-27 · 💻 cs.CV

ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation

Xianwei Zheng , Linxi Huan , Hanjiang Xiong , Jianya Gong This is my paper

Pith reviewed 2026-05-25 15:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationedge-aware losslarge kernel pyramid poolingencoder-decoder networkmulti-scale featuresboundary refinementCityscapes dataset

0 comments

The pith

ELKPPNet achieves superior semantic segmentation on Cityscapes, CamVid, and NYUDv2 by pairing a balanced encoder-decoder with large kernel pyramid pooling and an edge-aware loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ELKPPNet as an end-to-end network to solve the problem of insufficient discriminative feature learning in semantic segmentation. It builds a balanced encoder-decoder to reduce gaps between multi-level features, adds a large kernel spatial pyramid pooling block that expands receptive fields for multi-scale fusion, and introduces an edge-aware loss that refines object boundaries straight from the segmentation output. A sympathetic reader would care because the combination targets two practical failures: missing small or large objects and confusing adjacent regions that look alike. Experiments on three standard benchmarks show the full model beats prior methods when conditions are matched.

Core claim

The central claim is that the ELKPPNet architecture, formed by a balanced encoder-decoder network, the LKPP block with densely expanding receptive field, and the new edge-aware loss applied directly to the prediction map, produces more robust and discriminative features that improve both multi-scale object detection and boundary accuracy.

What carries the argument

The large kernel spatial pyramid pooling (LKPP) block that creates a densely expanding receptive field for multi-scale feature extraction and fusion, together with the edge-aware loss that operates directly on the semantic segmentation prediction.

If this is right

Models can handle multi-scale objects more reliably in both urban driving scenes and indoor environments.
Adjacent objects with similar appearance become easier to separate without extra post-processing.
Semantic consistency inside single objects improves because boundary signals feed back into feature learning.
The same loss can be attached to other encoder-decoder backbones to gain boundary refinement without redesigning the whole network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The edge-aware loss could be tested as a plug-in module on existing state-of-the-art segmentation networks to measure isolated gains.
Large-kernel pyramid designs might transfer to other dense-prediction tasks such as depth estimation or surface normal prediction.
Evaluating the model on additional datasets like ADE20K would reveal whether the gains hold beyond the three reported benchmarks.

Load-bearing premise

That the edge-aware loss function refines boundaries directly from the semantic segmentation prediction to yield more robust and discriminative features.

What would settle it

If ELKPPNet fails to exceed the accuracy of the strongest competing methods on the Cityscapes validation set when trained and evaluated under identical conditions and protocols, the superiority claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.11428 by Hanjiang Xiong, Jianya Gong, Linxi Huan, Xianwei Zheng.

**Figure 8.** Figure 8: Zoomed-in qualitative results. 4.4.3 Evaluation of Edge-aware Loss Function A further evaluation was also made for the proposed edge-aware cross-entropy loss function (also referred as ECE loss). Resnet-50 with the proposed balanced encoder-decoder framework was applied as the baseline network, and the two loss functions, i.e., CE loss and the proposed ECE loss, were first studied on the baseline network. … view at source ↗

read the original abstract

Semantic segmentation has been a hot topic across diverse research fields. Along with the success of deep convolutional neural networks, semantic segmentation has made great achievements and improvements, in terms of both urban scene parsing and indoor semantic segmentation. However, most of the state-of-the-art models are still faced with a challenge in discriminative feature learning, which limits the ability of a model to detect multi-scale objects and to guarantee semantic consistency inside one object or distinguish different adjacent objects with similar appearance. In this paper, a practical and efficient edge-aware neural network is presented for semantic segmentation. This end-to-end trainable engine consists of a new encoder-decoder network, a large kernel spatial pyramid pooling (LKPP) block, and an edge-aware loss function. The encoder-decoder network was designed as a balanced structure to narrow the semantic and resolution gaps in multi-level feature aggregation, while the LKPP block was constructed with a densely expanding receptive field for multi-scale feature extraction and fusion. Furthermore, the new powerful edge-aware loss function is proposed to refine the boundaries directly from the semantic segmentation prediction for more robust and discriminative features. The effectiveness of the proposed model was demonstrated using Cityscapes, CamVid, and NYUDv2 benchmark datasets. The performance of the two structures and the edge-aware loss function in ELKPPNet was validated on the Cityscapes dataset, while the complete ELKPPNet was evaluated on the CamVid and NYUDv2 datasets. A comparative analysis with the state-of-the-art methods under the same conditions confirmed the superiority of the proposed algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELKPPNet is a practical incremental architecture for semantic segmentation that adds a large-kernel pyramid pooling block and edge-aware loss to a balanced encoder-decoder, with reported gains on Cityscapes, CamVid, and NYUDv2.

read the letter

The main takeaway is that this paper delivers a working combination of three pieces for semantic segmentation: a balanced encoder-decoder to close feature gaps, the LKPP block for multi-scale extraction, and an edge-aware loss applied straight to the predictions. They back it with Cityscapes ablations plus full runs on CamVid and NYUDv2, claiming better numbers than prior methods under matched conditions. The LKPP design with its expanding receptive fields and the direct edge loss are the concrete new elements, and the ablations give a reasonable check on what each part adds. The cross-dataset tests add some external grounding that is better than single-benchmark papers. The work stays grounded in public data and standard protocols, with no obvious internal contradictions in the setup. The soft spots are the usual ones for this kind of paper. The gains are extensions of pyramid pooling and edge supervision ideas rather than a new principle, so the advance is engineering-level. Kernel sizes and loss weights are free parameters that were almost certainly tuned on these same scenes, which limits how much we can read into the superiority claim without more generalization tests. No detailed failure analysis appears. This is useful for readers who build or adapt segmentation models for driving or robotics and want a ready-to-try block or loss term. It has enough experimental structure and reproducibility potential to go to a serious referee, even if the core ideas are not revolutionary.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes ELKPPNet, an end-to-end trainable encoder-decoder network augmented with a large kernel pyramid pooling (LKPP) block and an edge-aware loss function, for semantic segmentation. It claims that the balanced encoder-decoder narrows semantic and resolution gaps, the LKPP provides densely expanding receptive fields for multi-scale fusion, and the edge-aware loss refines boundaries directly from predictions to yield more discriminative features, resulting in superior performance over state-of-the-art methods on the Cityscapes, CamVid, and NYUDv2 benchmarks under comparable conditions, with component ablations reported on Cityscapes.

Significance. If the reported gains hold, the work supplies a practical architecture combining multi-scale pooling with boundary-aware supervision that could aid urban scene parsing and indoor segmentation tasks. The provision of Cityscapes ablations plus cross-dataset evaluation on two additional benchmarks supplies external grounding for the central empirical claim. The explicit design of the LKPP block and the end-to-end formulation are concrete contributions that can be directly compared by subsequent work.

major comments (2)

[§4.2] §4.2 (Cityscapes ablation table): the incremental mIoU gains attributed to the edge-aware loss are reported without standard deviations across multiple random seeds or statistical tests; this weakens the claim that the loss produces reliably more robust features, as the observed deltas could fall within run-to-run variance.
[§3.3] §3.3 (edge-aware loss): the formulation is stated to refine boundaries 'directly from the semantic segmentation prediction,' yet the loss expression incorporates ground-truth edge maps; this mismatch between the textual claim and the actual supervision signal is load-bearing for the interpretation of how discriminative features are learned.

minor comments (3)

[Figure 2] Figure 2 (LKPP block diagram): the kernel sizes and dilation rates inside the pyramid levels are not numerically annotated on the figure itself, forcing the reader to cross-reference the text.
[Table 1] Table 1 (Cityscapes results): the column headers for 'Params' and 'FPS' are present but the corresponding values for the proposed model are omitted in one row, breaking direct efficiency comparison.
[§5] §5 (NYUDv2 evaluation): the protocol states 'same conditions' as prior work, yet the exact training schedule, crop size, and data augmentation details are only summarized rather than tabulated against the cited baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. The two major comments are addressed point-by-point below with honest responses on what can be revised.

read point-by-point responses

Referee: [§4.2] §4.2 (Cityscapes ablation table): the incremental mIoU gains attributed to the edge-aware loss are reported without standard deviations across multiple random seeds or statistical tests; this weakens the claim that the loss produces reliably more robust features, as the observed deltas could fall within run-to-run variance.

Authors: We agree that the absence of standard deviations or statistical tests in the §4.2 ablation table limits the strength of claims about reliable improvements from the edge-aware loss. The reported results were obtained from single training runs, which was standard practice at the time given the high computational cost of Cityscapes experiments. In the revised manuscript we will add an explicit note acknowledging this limitation and the possibility that small deltas may lie within run-to-run variance; we will also report standard deviations for the key ablations if additional compute can be secured. revision: partial
Referee: [§3.3] §3.3 (edge-aware loss): the formulation is stated to refine boundaries 'directly from the semantic segmentation prediction,' yet the loss expression incorporates ground-truth edge maps; this mismatch between the textual claim and the actual supervision signal is load-bearing for the interpretation of how discriminative features are learned.

Authors: We thank the referee for identifying this inconsistency in §3.3. The edge-aware loss does use ground-truth edge maps (extracted from the semantic labels) together with the model's semantic segmentation prediction to supervise boundary refinement. The original wording was imprecise and overstated the degree to which refinement occurs solely from the prediction. We will revise the description in §3.3 to accurately state that the loss combines the prediction with GT edge maps, thereby clarifying how the supervision signal contributes to more discriminative features. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an encoder-decoder architecture, LKPP block, and edge-aware loss, then reports empirical results on public benchmarks (Cityscapes, CamVid, NYUDv2) with SOTA comparisons under matched conditions. No equations, derivations, or self-citations are shown that reduce any claimed result to its inputs by construction; performance claims rest on external dataset evaluations rather than internal fitting or renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions about convolutional feature learning plus empirical tuning of new architectural elements on benchmark data.

free parameters (1)

kernel sizes and pyramid levels in LKPP
Design choices for expanding receptive field, selected to achieve multi-scale fusion and likely optimized on validation splits.

axioms (1)

domain assumption Convolutional encoder-decoder networks augmented with multi-scale pooling and edge supervision can learn more discriminative features for semantic segmentation.
Core premise invoked to justify the three components.

pith-pipeline@v0.9.0 · 5828 in / 1202 out tokens · 30267 ms · 2026-05-25T15:12:07.603232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation Xianwei Zheng1,*, Linxi Huan1, Hanjiang Xiong1, Jianya Gong1,2 1The State Key Laboratory of Information Engineering in Su rveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China 2School of Remote Sensing and In...

work page 2019
[2]

2017; Zhang et al

, urban 3D semantic modeling (Rouhani et al. 2017; Zhang et al

work page 2017
[3]

2018; Politz and Sester

and remote sensing image classification (Kemker et al. 2018; Politz and Sester

work page 2018
[4]

Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes

Semantic segmentation developed slowly because the various objects found in an image limit the efficiency of automatic scene parsing, until the popularization of deep learning. Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes. As DCNNs have the ability to lear...

work page 2015
[5]

methods based on image pyramids (Zhao et al. 2018)

work page 2018
[6]

2017); and

methods applying an encoder -decoder structure (Badrinarayanan et al. 2017); and

work page 2017
[7]

2017; Chen et al

methods deploying spatial pyramid pooling (SPP) (Zhao et al. 2017; Chen et al

work page 2017
[8]

Existing methods for multi-scale context extraction (Chen et al. 2017). The other difficulty for precise semantic segmentation lies in detail refinement. Most deep learning methods are not sensitive to detail information, and thus they often cannot maintain semantic consistency inside a single object (intra-class inconsistency) or distinguish two semantic...

work page 2017
[9]

gridding

and Zhou et al. (2018) . This approach also refines the semantic boundary for prediction using geometrical information from the low -level features. The SPP module, i.e., LKPP, is constructed with large kernels with hybrid asymmetric dilated convolutions to overcome the limitations of the existing SPP modules. The LKPP module can encode rich spatial infor...

work page 2018
[10]

Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion

2 Related work 2.1 Multi-scale Object Detection Scale variance of objects occurs frequently in natural and remote sensing images, and influences the learning ability of deep networks for semantic segmentation. Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion. The...

work page 2016
[11]

to model region similarities (Zheng 2015 ; Li 2016; Chen 2016), and some adopted several sequential convolutional layers to extract long-range information (Yu and Koltun 2016; Liu et al. 2015). DenseASPP involves organizing atrous convolutional layers with increasing rates in a dense fashion to enlarge receptive filed size (Yang et al

work page 2015
[12]

Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation

However, in practice, the extra subnetwork brings heavy computational complexity and a high memory footprint. Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation. The encoder-decoder framework achieves multi-level feature aggregation by merging low-lev...

work page 2015
[13]

(2016) and Li et al

Jegou et al. (2016) and Li et al. (2019) constructed dense multi -scale connections for fe ature aggregation, and Yu et al. (2018) hierarchically fused multi-level features by deep layer aggregation. However, these methods often need well -designed aggregation structures, which require prior knowledge and introduce a large number of parameters, which come...

work page 2016
[14]

gridding

the “gridding” problem, which happens when the view field is enlarged by dilated convolutional layers (Wang et al. 2018). In the proposed network, the balanced encoder -decoder framework is capable of efficient and computation-saving multi -level feature aggregation, and the novel spatial pyramid pooling module — LKPP—can obtain highly rich contextual fea...

work page 2018
[15]

intra-class inconsistency

, while Yu et al. (2018) combined semantic segmentation and boundary detection by two subnetworks —Smooth Network and Border Network —to address the “intra-class inconsistency” issue and enlarge the “inter-class distinction”. Jiang et al. (2017), Lee et al. (2017) and Marmanis et al. (2018) extracted edge features from DEM data or a depth map. However, te...

work page 2018
[16]

3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig

The whole network architecture of the proposed ELKPPNet. 3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig. 2, features a residual network as an encoder, and a decoder followed by a classifier layer and an edge extractor. ELKPPNet takes an RGB image as input, and outputs a semantic segmentation prediction at the classifie...

work page 2015
[17]

It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect

with the proposed LKPP module. It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect. As demonstrated in Chen et al. (2017), the larger the dilation rate grows, the small the number of effective kernel weights will become. For example, if the filter size is close to the feature ...

work page 2017
[18]

gridding

, but such a solution also causes the problem known as “gridding” (Wang et al. 2018). Taking 3k  and 2r  for illustration, if a group of sequential convolutional layers have the same rate r , then given an arbitrary pixel p of the top layer il , its receptive field is formed in a checkerboard fashion, meaning much of the information from the input is di...

work page 2018
[19]

gridding

is a solution to address the ‘gridding’ issue. (a) (b) (c) (d) Layer3, rate=2 Layer2, rate=2 Layer1, rate=2 Layer3, rate=3 Layer2, rate=2 Layer1, rate=1 Rate=1 Rate=12Rate=3 Given N convolutional layers  1,..., Nll with kernels of size kk chained in cascade, and  1,..., Nrr denote their dilation rates, we can define the maximum distance between nonze...

work page 2016
[20]

gridding

The two-layer convolution in an HADC block. (a) Large Kernel Pyramid Pooling. (b) The HADC block in parallel LKPP. (c) The HADC block in cascade LKPP. Cascade LKPP: In cascade LKPP, each HADC branch consists of three two-layer pairs, and the layers in each pair are joi ned sequentially, which can greatly expand receptive field size, and therefore is appli...

work page 2015
[21]

intra-class inconsistency

Edge Extractor. (a) Edge map under different k . Left: Edge map with =1k ; Right: Edge map with =3k . (b) The mechanism of edge extractor. Edge detection is a binary classification problem, but the gradient map only contains semantic edge information an d optimizing semantic edge, may introduce unnecessary computation and require more GPU memory, as in Li...

work page 2018
[22]

2015; Cordts et al

network was selected as the backbone for all the m odels, and the experiments were conducted on three challen ging semantic segmentation datasets: the Cityscapes (Cordts et al. 2015; Cordts et al. 2016), and CamVid (Fauqueur et al. 2007; Badrinarayanan et al

work page 2015
[23]

2012; Lee et al

outdoor datasets and the NYUDv2 indoor scene parsing benchmark dataset (Silberman et al. 2012; Lee et al. 2017). Ablation studies were first conducted on the Cityscapes dataset to validate the proposed balanced encoder-decoder structure, the LKPP module, and the ECE loss function, respectively. To allow a comprehensive evaluation, the whole ELKPPNet was f...

work page 2012
[24]

and PSPNet (Zhao et al. 2017). In all the experiments, except for mirror flip, no extra training tricks were used, especially those related to detail augmentation and multi-scale detection, because other training tricks add more random information, making it difficult to determine whether the discriminative feature learning ability is boosted by the given...

work page 2017
[25]

Prediction results of U-Net and the balanced encoder-decoder. In (c), the U-Net structure yields droplet-like over-smoothed patches, which even erase the corners of the traffic sign (yellow) and distort its square shape into a nearly round one. In (d), the proposed balanced encoder-decoder framework more precisely draws out the contours of trees, pedestri...

work page 2018
[26]

gridding

In DilatedNet, the final upsampling operation helps to remove the visible “gridding” problem in the final output. The dilation rates of ASPP and DenseASPP were the same as Chen et al. (2018) (i.e., ASPP module with dilation rate of 1, 6, 12,

work page 2018
[27]

gridding

and Yang et al. (2018) (i.e., DenseASPP with dilation rate of 3, 6, 12, 18, 24). The kernels used in the LKPP module were set to 3×3, 3×5 (5×3), and 3×7 (7×3), and the rates in every HADC of the LKPP module were set as 1,2,3, to avoid superfluous invalid information caused by zero values introduced by large di lations. The baseline was a ResNet -50 networ...

work page 2018
[28]

Quantitive analysis on 37-class NYUDv2 dataset (unit: %). Metric Model mIoU FWIoU PixelAcc MeanClassAcc Deeplabv3 28.51 48.32 64.27 34.48 Deeplabv3+ 29.30 50.09 65.69 35.03 DenseASPP 30.77 50.53 67.13 35.36 PSPNet 24.11 45.75 61.18 29.93 RefineNet 29.40 50.79 66.92 34.43 Our ELKPPNet (parallel) 34.41 55.11 70.03 39.00 Test Image Ground Truth DeepLabV3 Dee...

work page 2015
[29]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Tensorflow: Large- scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. https://arxiv.org/abs/1603.04467 Badrinarayanan V ., Kendall A., Cipolla R.,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Rethinking Atrous Convolution for Semantic Image Segmentation

Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. https://arxiv.org/abs/1706.05587 Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S. and Schiele B.,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

IEEE International Conference on Computer Vision

Predicting Depth, Surface Normals and Semantic La bels with a Common Multi-scale Convolutional Architecture. IEEE International Conference on Computer Vision. Farabet C., Couprie C., Najman L., Lecun Y ., 2013, Learning Hierarchical Features for Scene Labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915-1929. Fauqueur J., B...

work page 2013
[32]

2007 IEEE International Conference on Computer Vision, 1-7, IEEE

Assisted video object labeling by joint tracking of regions and keypoints. 2007 IEEE International Conference on Computer Vision, 1-7, IEEE. Gonzalez R. and Woods R.,

work page 2007
[33]

IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-

work page 1904
[34]

2017 IEEE International Conference on Software Engine ering and Service Science

Incorporating depth into both CNN and CRF for indoor semantic segmentation. 2017 IEEE International Conference on Software Engine ering and Service Science. Kemker R., Salvaggio C. and Kanan C.,

work page 2017
[35]

2017 IEEE International Conference on Computer Vision

RDFNet: RGB -D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation. 2017 IEEE International Conference on Computer Vision. Li H., Xiong P., Fan H. and Sun J.,

work page 2017
[36]

DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation

DFANet: Deep Feature Aggregation for Real -Time Semantic Segmentation. arXiv preprint arXiv:1904.02216. https://arxiv.org/abs/1904.02216 Li W. and Yang M.,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[37]

2017 IEEE Conference on Computer Vision and Pattern Recognition

RefineNet: Multi -path Refinement Networks for High - Resolution Semantic Segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition. Lin T., Goyal P., Girshick R., He K. and Piotr D.,

work page 2017
[38]

arXiv preprint arXiv:1804.02864

Semantic edge detection with diverse deep supervision. arXiv preprint arXiv:1804.02864. https://arxiv.org/abs/1804.02864 Liu Y ., Cheng M., Hu X., Wang K. and Bai X.,

work page arXiv
[39]

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr

Exploring ALS and DIM data for semantic segmentation using CNNs. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr. 1 42(1): 347-354. Ronneberger O., Fischer P. and Brox T.,

work page 2018
[40]

2018 IEEE Winter Conference on Applications of Computer Vision (pp

Understanding convolution for semantic segmentation. 2018 IEEE Winter Conference on Applications of Computer Vision (pp. 1451-1460) Xiao J., Owens A. and Torralba A.,

work page 2018

[1] [1]

ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation Xianwei Zheng1,*, Linxi Huan1, Hanjiang Xiong1, Jianya Gong1,2 1The State Key Laboratory of Information Engineering in Su rveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China 2School of Remote Sensing and In...

work page 2019

[2] [2]

2017; Zhang et al

, urban 3D semantic modeling (Rouhani et al. 2017; Zhang et al

work page 2017

[3] [3]

2018; Politz and Sester

and remote sensing image classification (Kemker et al. 2018; Politz and Sester

work page 2018

[4] [4]

Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes

Semantic segmentation developed slowly because the various objects found in an image limit the efficiency of automatic scene parsing, until the popularization of deep learning. Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes. As DCNNs have the ability to lear...

work page 2015

[5] [5]

methods based on image pyramids (Zhao et al. 2018)

work page 2018

[6] [6]

2017); and

methods applying an encoder -decoder structure (Badrinarayanan et al. 2017); and

work page 2017

[7] [7]

2017; Chen et al

methods deploying spatial pyramid pooling (SPP) (Zhao et al. 2017; Chen et al

work page 2017

[8] [8]

Existing methods for multi-scale context extraction (Chen et al. 2017). The other difficulty for precise semantic segmentation lies in detail refinement. Most deep learning methods are not sensitive to detail information, and thus they often cannot maintain semantic consistency inside a single object (intra-class inconsistency) or distinguish two semantic...

work page 2017

[9] [9]

gridding

and Zhou et al. (2018) . This approach also refines the semantic boundary for prediction using geometrical information from the low -level features. The SPP module, i.e., LKPP, is constructed with large kernels with hybrid asymmetric dilated convolutions to overcome the limitations of the existing SPP modules. The LKPP module can encode rich spatial infor...

work page 2018

[10] [10]

Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion

2 Related work 2.1 Multi-scale Object Detection Scale variance of objects occurs frequently in natural and remote sensing images, and influences the learning ability of deep networks for semantic segmentation. Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion. The...

work page 2016

[11] [11]

to model region similarities (Zheng 2015 ; Li 2016; Chen 2016), and some adopted several sequential convolutional layers to extract long-range information (Yu and Koltun 2016; Liu et al. 2015). DenseASPP involves organizing atrous convolutional layers with increasing rates in a dense fashion to enlarge receptive filed size (Yang et al

work page 2015

[12] [12]

Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation

However, in practice, the extra subnetwork brings heavy computational complexity and a high memory footprint. Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation. The encoder-decoder framework achieves multi-level feature aggregation by merging low-lev...

work page 2015

[13] [13]

(2016) and Li et al

Jegou et al. (2016) and Li et al. (2019) constructed dense multi -scale connections for fe ature aggregation, and Yu et al. (2018) hierarchically fused multi-level features by deep layer aggregation. However, these methods often need well -designed aggregation structures, which require prior knowledge and introduce a large number of parameters, which come...

work page 2016

[14] [14]

gridding

the “gridding” problem, which happens when the view field is enlarged by dilated convolutional layers (Wang et al. 2018). In the proposed network, the balanced encoder -decoder framework is capable of efficient and computation-saving multi -level feature aggregation, and the novel spatial pyramid pooling module — LKPP—can obtain highly rich contextual fea...

work page 2018

[15] [15]

intra-class inconsistency

, while Yu et al. (2018) combined semantic segmentation and boundary detection by two subnetworks —Smooth Network and Border Network —to address the “intra-class inconsistency” issue and enlarge the “inter-class distinction”. Jiang et al. (2017), Lee et al. (2017) and Marmanis et al. (2018) extracted edge features from DEM data or a depth map. However, te...

work page 2018

[16] [16]

3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig

The whole network architecture of the proposed ELKPPNet. 3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig. 2, features a residual network as an encoder, and a decoder followed by a classifier layer and an edge extractor. ELKPPNet takes an RGB image as input, and outputs a semantic segmentation prediction at the classifie...

work page 2015

[17] [17]

It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect

with the proposed LKPP module. It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect. As demonstrated in Chen et al. (2017), the larger the dilation rate grows, the small the number of effective kernel weights will become. For example, if the filter size is close to the feature ...

work page 2017

[18] [18]

gridding

, but such a solution also causes the problem known as “gridding” (Wang et al. 2018). Taking 3k  and 2r  for illustration, if a group of sequential convolutional layers have the same rate r , then given an arbitrary pixel p of the top layer il , its receptive field is formed in a checkerboard fashion, meaning much of the information from the input is di...

work page 2018

[19] [19]

gridding

is a solution to address the ‘gridding’ issue. (a) (b) (c) (d) Layer3, rate=2 Layer2, rate=2 Layer1, rate=2 Layer3, rate=3 Layer2, rate=2 Layer1, rate=1 Rate=1 Rate=12Rate=3 Given N convolutional layers  1,..., Nll with kernels of size kk chained in cascade, and  1,..., Nrr denote their dilation rates, we can define the maximum distance between nonze...

work page 2016

[20] [20]

gridding

The two-layer convolution in an HADC block. (a) Large Kernel Pyramid Pooling. (b) The HADC block in parallel LKPP. (c) The HADC block in cascade LKPP. Cascade LKPP: In cascade LKPP, each HADC branch consists of three two-layer pairs, and the layers in each pair are joi ned sequentially, which can greatly expand receptive field size, and therefore is appli...

work page 2015

[21] [21]

intra-class inconsistency

Edge Extractor. (a) Edge map under different k . Left: Edge map with =1k ; Right: Edge map with =3k . (b) The mechanism of edge extractor. Edge detection is a binary classification problem, but the gradient map only contains semantic edge information an d optimizing semantic edge, may introduce unnecessary computation and require more GPU memory, as in Li...

work page 2018

[22] [22]

2015; Cordts et al

network was selected as the backbone for all the m odels, and the experiments were conducted on three challen ging semantic segmentation datasets: the Cityscapes (Cordts et al. 2015; Cordts et al. 2016), and CamVid (Fauqueur et al. 2007; Badrinarayanan et al

work page 2015

[23] [23]

2012; Lee et al

outdoor datasets and the NYUDv2 indoor scene parsing benchmark dataset (Silberman et al. 2012; Lee et al. 2017). Ablation studies were first conducted on the Cityscapes dataset to validate the proposed balanced encoder-decoder structure, the LKPP module, and the ECE loss function, respectively. To allow a comprehensive evaluation, the whole ELKPPNet was f...

work page 2012

[24] [24]

and PSPNet (Zhao et al. 2017). In all the experiments, except for mirror flip, no extra training tricks were used, especially those related to detail augmentation and multi-scale detection, because other training tricks add more random information, making it difficult to determine whether the discriminative feature learning ability is boosted by the given...

work page 2017

[25] [25]

Prediction results of U-Net and the balanced encoder-decoder. In (c), the U-Net structure yields droplet-like over-smoothed patches, which even erase the corners of the traffic sign (yellow) and distort its square shape into a nearly round one. In (d), the proposed balanced encoder-decoder framework more precisely draws out the contours of trees, pedestri...

work page 2018

[26] [26]

gridding

In DilatedNet, the final upsampling operation helps to remove the visible “gridding” problem in the final output. The dilation rates of ASPP and DenseASPP were the same as Chen et al. (2018) (i.e., ASPP module with dilation rate of 1, 6, 12,

work page 2018

[27] [27]

gridding

and Yang et al. (2018) (i.e., DenseASPP with dilation rate of 3, 6, 12, 18, 24). The kernels used in the LKPP module were set to 3×3, 3×5 (5×3), and 3×7 (7×3), and the rates in every HADC of the LKPP module were set as 1,2,3, to avoid superfluous invalid information caused by zero values introduced by large di lations. The baseline was a ResNet -50 networ...

work page 2018

[28] [28]

Quantitive analysis on 37-class NYUDv2 dataset (unit: %). Metric Model mIoU FWIoU PixelAcc MeanClassAcc Deeplabv3 28.51 48.32 64.27 34.48 Deeplabv3+ 29.30 50.09 65.69 35.03 DenseASPP 30.77 50.53 67.13 35.36 PSPNet 24.11 45.75 61.18 29.93 RefineNet 29.40 50.79 66.92 34.43 Our ELKPPNet (parallel) 34.41 55.11 70.03 39.00 Test Image Ground Truth DeepLabV3 Dee...

work page 2015

[29] [29]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Tensorflow: Large- scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. https://arxiv.org/abs/1603.04467 Badrinarayanan V ., Kendall A., Cipolla R.,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Rethinking Atrous Convolution for Semantic Image Segmentation

Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. https://arxiv.org/abs/1706.05587 Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S. and Schiele B.,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

IEEE International Conference on Computer Vision

Predicting Depth, Surface Normals and Semantic La bels with a Common Multi-scale Convolutional Architecture. IEEE International Conference on Computer Vision. Farabet C., Couprie C., Najman L., Lecun Y ., 2013, Learning Hierarchical Features for Scene Labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915-1929. Fauqueur J., B...

work page 2013

[32] [32]

2007 IEEE International Conference on Computer Vision, 1-7, IEEE

Assisted video object labeling by joint tracking of regions and keypoints. 2007 IEEE International Conference on Computer Vision, 1-7, IEEE. Gonzalez R. and Woods R.,

work page 2007

[33] [33]

IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-

work page 1904

[34] [34]

2017 IEEE International Conference on Software Engine ering and Service Science

Incorporating depth into both CNN and CRF for indoor semantic segmentation. 2017 IEEE International Conference on Software Engine ering and Service Science. Kemker R., Salvaggio C. and Kanan C.,

work page 2017

[35] [35]

2017 IEEE International Conference on Computer Vision

RDFNet: RGB -D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation. 2017 IEEE International Conference on Computer Vision. Li H., Xiong P., Fan H. and Sun J.,

work page 2017

[36] [36]

DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation

DFANet: Deep Feature Aggregation for Real -Time Semantic Segmentation. arXiv preprint arXiv:1904.02216. https://arxiv.org/abs/1904.02216 Li W. and Yang M.,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[37] [37]

2017 IEEE Conference on Computer Vision and Pattern Recognition

RefineNet: Multi -path Refinement Networks for High - Resolution Semantic Segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition. Lin T., Goyal P., Girshick R., He K. and Piotr D.,

work page 2017

[38] [38]

arXiv preprint arXiv:1804.02864

Semantic edge detection with diverse deep supervision. arXiv preprint arXiv:1804.02864. https://arxiv.org/abs/1804.02864 Liu Y ., Cheng M., Hu X., Wang K. and Bai X.,

work page arXiv

[39] [39]

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr

Exploring ALS and DIM data for semantic segmentation using CNNs. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr. 1 42(1): 347-354. Ronneberger O., Fischer P. and Brox T.,

work page 2018

[40] [40]

2018 IEEE Winter Conference on Applications of Computer Vision (pp

Understanding convolution for semantic segmentation. 2018 IEEE Winter Conference on Applications of Computer Vision (pp. 1451-1460) Xiao J., Owens A. and Torralba A.,

work page 2018