ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation
Pith reviewed 2026-05-25 15:12 UTC · model grok-4.3
The pith
ELKPPNet achieves superior semantic segmentation on Cityscapes, CamVid, and NYUDv2 by pairing a balanced encoder-decoder with large kernel pyramid pooling and an edge-aware loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the ELKPPNet architecture, formed by a balanced encoder-decoder network, the LKPP block with densely expanding receptive field, and the new edge-aware loss applied directly to the prediction map, produces more robust and discriminative features that improve both multi-scale object detection and boundary accuracy.
What carries the argument
The large kernel spatial pyramid pooling (LKPP) block that creates a densely expanding receptive field for multi-scale feature extraction and fusion, together with the edge-aware loss that operates directly on the semantic segmentation prediction.
If this is right
- Models can handle multi-scale objects more reliably in both urban driving scenes and indoor environments.
- Adjacent objects with similar appearance become easier to separate without extra post-processing.
- Semantic consistency inside single objects improves because boundary signals feed back into feature learning.
- The same loss can be attached to other encoder-decoder backbones to gain boundary refinement without redesigning the whole network.
Where Pith is reading between the lines
- The edge-aware loss could be tested as a plug-in module on existing state-of-the-art segmentation networks to measure isolated gains.
- Large-kernel pyramid designs might transfer to other dense-prediction tasks such as depth estimation or surface normal prediction.
- Evaluating the model on additional datasets like ADE20K would reveal whether the gains hold beyond the three reported benchmarks.
Load-bearing premise
That the edge-aware loss function refines boundaries directly from the semantic segmentation prediction to yield more robust and discriminative features.
What would settle it
If ELKPPNet fails to exceed the accuracy of the strongest competing methods on the Cityscapes validation set when trained and evaluated under identical conditions and protocols, the superiority claim would be falsified.
Figures
read the original abstract
Semantic segmentation has been a hot topic across diverse research fields. Along with the success of deep convolutional neural networks, semantic segmentation has made great achievements and improvements, in terms of both urban scene parsing and indoor semantic segmentation. However, most of the state-of-the-art models are still faced with a challenge in discriminative feature learning, which limits the ability of a model to detect multi-scale objects and to guarantee semantic consistency inside one object or distinguish different adjacent objects with similar appearance. In this paper, a practical and efficient edge-aware neural network is presented for semantic segmentation. This end-to-end trainable engine consists of a new encoder-decoder network, a large kernel spatial pyramid pooling (LKPP) block, and an edge-aware loss function. The encoder-decoder network was designed as a balanced structure to narrow the semantic and resolution gaps in multi-level feature aggregation, while the LKPP block was constructed with a densely expanding receptive field for multi-scale feature extraction and fusion. Furthermore, the new powerful edge-aware loss function is proposed to refine the boundaries directly from the semantic segmentation prediction for more robust and discriminative features. The effectiveness of the proposed model was demonstrated using Cityscapes, CamVid, and NYUDv2 benchmark datasets. The performance of the two structures and the edge-aware loss function in ELKPPNet was validated on the Cityscapes dataset, while the complete ELKPPNet was evaluated on the CamVid and NYUDv2 datasets. A comparative analysis with the state-of-the-art methods under the same conditions confirmed the superiority of the proposed algorithm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ELKPPNet, an end-to-end trainable encoder-decoder network augmented with a large kernel pyramid pooling (LKPP) block and an edge-aware loss function, for semantic segmentation. It claims that the balanced encoder-decoder narrows semantic and resolution gaps, the LKPP provides densely expanding receptive fields for multi-scale fusion, and the edge-aware loss refines boundaries directly from predictions to yield more discriminative features, resulting in superior performance over state-of-the-art methods on the Cityscapes, CamVid, and NYUDv2 benchmarks under comparable conditions, with component ablations reported on Cityscapes.
Significance. If the reported gains hold, the work supplies a practical architecture combining multi-scale pooling with boundary-aware supervision that could aid urban scene parsing and indoor segmentation tasks. The provision of Cityscapes ablations plus cross-dataset evaluation on two additional benchmarks supplies external grounding for the central empirical claim. The explicit design of the LKPP block and the end-to-end formulation are concrete contributions that can be directly compared by subsequent work.
major comments (2)
- [§4.2] §4.2 (Cityscapes ablation table): the incremental mIoU gains attributed to the edge-aware loss are reported without standard deviations across multiple random seeds or statistical tests; this weakens the claim that the loss produces reliably more robust features, as the observed deltas could fall within run-to-run variance.
- [§3.3] §3.3 (edge-aware loss): the formulation is stated to refine boundaries 'directly from the semantic segmentation prediction,' yet the loss expression incorporates ground-truth edge maps; this mismatch between the textual claim and the actual supervision signal is load-bearing for the interpretation of how discriminative features are learned.
minor comments (3)
- [Figure 2] Figure 2 (LKPP block diagram): the kernel sizes and dilation rates inside the pyramid levels are not numerically annotated on the figure itself, forcing the reader to cross-reference the text.
- [Table 1] Table 1 (Cityscapes results): the column headers for 'Params' and 'FPS' are present but the corresponding values for the proposed model are omitted in one row, breaking direct efficiency comparison.
- [§5] §5 (NYUDv2 evaluation): the protocol states 'same conditions' as prior work, yet the exact training schedule, crop size, and data augmentation details are only summarized rather than tabulated against the cited baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. The two major comments are addressed point-by-point below with honest responses on what can be revised.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Cityscapes ablation table): the incremental mIoU gains attributed to the edge-aware loss are reported without standard deviations across multiple random seeds or statistical tests; this weakens the claim that the loss produces reliably more robust features, as the observed deltas could fall within run-to-run variance.
Authors: We agree that the absence of standard deviations or statistical tests in the §4.2 ablation table limits the strength of claims about reliable improvements from the edge-aware loss. The reported results were obtained from single training runs, which was standard practice at the time given the high computational cost of Cityscapes experiments. In the revised manuscript we will add an explicit note acknowledging this limitation and the possibility that small deltas may lie within run-to-run variance; we will also report standard deviations for the key ablations if additional compute can be secured. revision: partial
-
Referee: [§3.3] §3.3 (edge-aware loss): the formulation is stated to refine boundaries 'directly from the semantic segmentation prediction,' yet the loss expression incorporates ground-truth edge maps; this mismatch between the textual claim and the actual supervision signal is load-bearing for the interpretation of how discriminative features are learned.
Authors: We thank the referee for identifying this inconsistency in §3.3. The edge-aware loss does use ground-truth edge maps (extracted from the semantic labels) together with the model's semantic segmentation prediction to supervise boundary refinement. The original wording was imprecise and overstated the degree to which refinement occurs solely from the prediction. We will revise the description in §3.3 to accurately state that the loss combines the prediction with GT edge maps, thereby clarifying how the supervision signal contributes to more discriminative features. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes an encoder-decoder architecture, LKPP block, and edge-aware loss, then reports empirical results on public benchmarks (Cityscapes, CamVid, NYUDv2) with SOTA comparisons under matched conditions. No equations, derivations, or self-citations are shown that reduce any claimed result to its inputs by construction; performance claims rest on external dataset evaluations rather than internal fitting or renaming.
Axiom & Free-Parameter Ledger
free parameters (1)
- kernel sizes and pyramid levels in LKPP
axioms (1)
- domain assumption Convolutional encoder-decoder networks augmented with multi-scale pooling and edge supervision can learn more discriminative features for semantic segmentation.
Reference graph
Works this paper leans on
-
[1]
ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation Xianwei Zheng1,*, Linxi Huan1, Hanjiang Xiong1, Jianya Gong1,2 1The State Key Laboratory of Information Engineering in Su rveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China 2School of Remote Sensing and In...
work page 2019
- [2]
-
[3]
and remote sensing image classification (Kemker et al. 2018; Politz and Sester
work page 2018
-
[4]
Semantic segmentation developed slowly because the various objects found in an image limit the efficiency of automatic scene parsing, until the popularization of deep learning. Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes. As DCNNs have the ability to lear...
work page 2015
-
[5]
methods based on image pyramids (Zhao et al. 2018)
work page 2018
-
[6]
methods applying an encoder -decoder structure (Badrinarayanan et al. 2017); and
work page 2017
-
[7]
methods deploying spatial pyramid pooling (SPP) (Zhao et al. 2017; Chen et al
work page 2017
-
[8]
Existing methods for multi-scale context extraction (Chen et al. 2017). The other difficulty for precise semantic segmentation lies in detail refinement. Most deep learning methods are not sensitive to detail information, and thus they often cannot maintain semantic consistency inside a single object (intra-class inconsistency) or distinguish two semantic...
work page 2017
-
[9]
and Zhou et al. (2018) . This approach also refines the semantic boundary for prediction using geometrical information from the low -level features. The SPP module, i.e., LKPP, is constructed with large kernels with hybrid asymmetric dilated convolutions to overcome the limitations of the existing SPP modules. The LKPP module can encode rich spatial infor...
work page 2018
-
[10]
2 Related work 2.1 Multi-scale Object Detection Scale variance of objects occurs frequently in natural and remote sensing images, and influences the learning ability of deep networks for semantic segmentation. Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion. The...
work page 2016
-
[11]
to model region similarities (Zheng 2015 ; Li 2016; Chen 2016), and some adopted several sequential convolutional layers to extract long-range information (Yu and Koltun 2016; Liu et al. 2015). DenseASPP involves organizing atrous convolutional layers with increasing rates in a dense fashion to enlarge receptive filed size (Yang et al
work page 2015
-
[12]
However, in practice, the extra subnetwork brings heavy computational complexity and a high memory footprint. Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation. The encoder-decoder framework achieves multi-level feature aggregation by merging low-lev...
work page 2015
-
[13]
Jegou et al. (2016) and Li et al. (2019) constructed dense multi -scale connections for fe ature aggregation, and Yu et al. (2018) hierarchically fused multi-level features by deep layer aggregation. However, these methods often need well -designed aggregation structures, which require prior knowledge and introduce a large number of parameters, which come...
work page 2016
-
[14]
the “gridding” problem, which happens when the view field is enlarged by dilated convolutional layers (Wang et al. 2018). In the proposed network, the balanced encoder -decoder framework is capable of efficient and computation-saving multi -level feature aggregation, and the novel spatial pyramid pooling module — LKPP—can obtain highly rich contextual fea...
work page 2018
-
[15]
, while Yu et al. (2018) combined semantic segmentation and boundary detection by two subnetworks —Smooth Network and Border Network —to address the “intra-class inconsistency” issue and enlarge the “inter-class distinction”. Jiang et al. (2017), Lee et al. (2017) and Marmanis et al. (2018) extracted edge features from DEM data or a depth map. However, te...
work page 2018
-
[16]
3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig
The whole network architecture of the proposed ELKPPNet. 3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig. 2, features a residual network as an encoder, and a decoder followed by a classifier layer and an edge extractor. ELKPPNet takes an RGB image as input, and outputs a semantic segmentation prediction at the classifie...
work page 2015
-
[17]
with the proposed LKPP module. It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect. As demonstrated in Chen et al. (2017), the larger the dilation rate grows, the small the number of effective kernel weights will become. For example, if the filter size is close to the feature ...
work page 2017
-
[18]
, but such a solution also causes the problem known as “gridding” (Wang et al. 2018). Taking 3k and 2r for illustration, if a group of sequential convolutional layers have the same rate r , then given an arbitrary pixel p of the top layer il , its receptive field is formed in a checkerboard fashion, meaning much of the information from the input is di...
work page 2018
-
[19]
is a solution to address the ‘gridding’ issue. (a) (b) (c) (d) Layer3, rate=2 Layer2, rate=2 Layer1, rate=2 Layer3, rate=3 Layer2, rate=2 Layer1, rate=1 Rate=1 Rate=12Rate=3 Given N convolutional layers 1,..., Nll with kernels of size kk chained in cascade, and 1,..., Nrr denote their dilation rates, we can define the maximum distance between nonze...
work page 2016
-
[20]
The two-layer convolution in an HADC block. (a) Large Kernel Pyramid Pooling. (b) The HADC block in parallel LKPP. (c) The HADC block in cascade LKPP. Cascade LKPP: In cascade LKPP, each HADC branch consists of three two-layer pairs, and the layers in each pair are joi ned sequentially, which can greatly expand receptive field size, and therefore is appli...
work page 2015
-
[21]
Edge Extractor. (a) Edge map under different k . Left: Edge map with =1k ; Right: Edge map with =3k . (b) The mechanism of edge extractor. Edge detection is a binary classification problem, but the gradient map only contains semantic edge information an d optimizing semantic edge, may introduce unnecessary computation and require more GPU memory, as in Li...
work page 2018
-
[22]
network was selected as the backbone for all the m odels, and the experiments were conducted on three challen ging semantic segmentation datasets: the Cityscapes (Cordts et al. 2015; Cordts et al. 2016), and CamVid (Fauqueur et al. 2007; Badrinarayanan et al
work page 2015
-
[23]
outdoor datasets and the NYUDv2 indoor scene parsing benchmark dataset (Silberman et al. 2012; Lee et al. 2017). Ablation studies were first conducted on the Cityscapes dataset to validate the proposed balanced encoder-decoder structure, the LKPP module, and the ECE loss function, respectively. To allow a comprehensive evaluation, the whole ELKPPNet was f...
work page 2012
-
[24]
and PSPNet (Zhao et al. 2017). In all the experiments, except for mirror flip, no extra training tricks were used, especially those related to detail augmentation and multi-scale detection, because other training tricks add more random information, making it difficult to determine whether the discriminative feature learning ability is boosted by the given...
work page 2017
-
[25]
Prediction results of U-Net and the balanced encoder-decoder. In (c), the U-Net structure yields droplet-like over-smoothed patches, which even erase the corners of the traffic sign (yellow) and distort its square shape into a nearly round one. In (d), the proposed balanced encoder-decoder framework more precisely draws out the contours of trees, pedestri...
work page 2018
- [26]
-
[27]
and Yang et al. (2018) (i.e., DenseASPP with dilation rate of 3, 6, 12, 18, 24). The kernels used in the LKPP module were set to 3×3, 3×5 (5×3), and 3×7 (7×3), and the rates in every HADC of the LKPP module were set as 1,2,3, to avoid superfluous invalid information caused by zero values introduced by large di lations. The baseline was a ResNet -50 networ...
work page 2018
-
[28]
Quantitive analysis on 37-class NYUDv2 dataset (unit: %). Metric Model mIoU FWIoU PixelAcc MeanClassAcc Deeplabv3 28.51 48.32 64.27 34.48 Deeplabv3+ 29.30 50.09 65.69 35.03 DenseASPP 30.77 50.53 67.13 35.36 PSPNet 24.11 45.75 61.18 29.93 RefineNet 29.40 50.79 66.92 34.43 Our ELKPPNet (parallel) 34.41 55.11 70.03 39.00 Test Image Ground Truth DeepLabV3 Dee...
work page 2015
-
[29]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Tensorflow: Large- scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. https://arxiv.org/abs/1603.04467 Badrinarayanan V ., Kendall A., Cipolla R.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Rethinking Atrous Convolution for Semantic Image Segmentation
Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. https://arxiv.org/abs/1706.05587 Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S. and Schiele B.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
IEEE International Conference on Computer Vision
Predicting Depth, Surface Normals and Semantic La bels with a Common Multi-scale Convolutional Architecture. IEEE International Conference on Computer Vision. Farabet C., Couprie C., Najman L., Lecun Y ., 2013, Learning Hierarchical Features for Scene Labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915-1929. Fauqueur J., B...
work page 2013
-
[32]
2007 IEEE International Conference on Computer Vision, 1-7, IEEE
Assisted video object labeling by joint tracking of regions and keypoints. 2007 IEEE International Conference on Computer Vision, 1-7, IEEE. Gonzalez R. and Woods R.,
work page 2007
-
[33]
IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-
work page 1904
-
[34]
2017 IEEE International Conference on Software Engine ering and Service Science
Incorporating depth into both CNN and CRF for indoor semantic segmentation. 2017 IEEE International Conference on Software Engine ering and Service Science. Kemker R., Salvaggio C. and Kanan C.,
work page 2017
-
[35]
2017 IEEE International Conference on Computer Vision
RDFNet: RGB -D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation. 2017 IEEE International Conference on Computer Vision. Li H., Xiong P., Fan H. and Sun J.,
work page 2017
-
[36]
DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
DFANet: Deep Feature Aggregation for Real -Time Semantic Segmentation. arXiv preprint arXiv:1904.02216. https://arxiv.org/abs/1904.02216 Li W. and Yang M.,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[37]
2017 IEEE Conference on Computer Vision and Pattern Recognition
RefineNet: Multi -path Refinement Networks for High - Resolution Semantic Segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition. Lin T., Goyal P., Girshick R., He K. and Piotr D.,
work page 2017
-
[38]
arXiv preprint arXiv:1804.02864
Semantic edge detection with diverse deep supervision. arXiv preprint arXiv:1804.02864. https://arxiv.org/abs/1804.02864 Liu Y ., Cheng M., Hu X., Wang K. and Bai X.,
-
[39]
Exploring ALS and DIM data for semantic segmentation using CNNs. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr. 1 42(1): 347-354. Ronneberger O., Fischer P. and Brox T.,
work page 2018
-
[40]
2018 IEEE Winter Conference on Applications of Computer Vision (pp
Understanding convolution for semantic segmentation. 2018 IEEE Winter Conference on Applications of Computer Vision (pp. 1451-1460) Xiao J., Owens A. and Torralba A.,
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.