pith. sign in

arxiv: 1906.11428 · v1 · pith:4HMYKA5Xnew · submitted 2019-06-27 · 💻 cs.CV

ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation

Pith reviewed 2026-05-25 15:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationedge-aware losslarge kernel pyramid poolingencoder-decoder networkmulti-scale featuresboundary refinementCityscapes dataset
0
0 comments X

The pith

ELKPPNet achieves superior semantic segmentation on Cityscapes, CamVid, and NYUDv2 by pairing a balanced encoder-decoder with large kernel pyramid pooling and an edge-aware loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ELKPPNet as an end-to-end network to solve the problem of insufficient discriminative feature learning in semantic segmentation. It builds a balanced encoder-decoder to reduce gaps between multi-level features, adds a large kernel spatial pyramid pooling block that expands receptive fields for multi-scale fusion, and introduces an edge-aware loss that refines object boundaries straight from the segmentation output. A sympathetic reader would care because the combination targets two practical failures: missing small or large objects and confusing adjacent regions that look alike. Experiments on three standard benchmarks show the full model beats prior methods when conditions are matched.

Core claim

The central claim is that the ELKPPNet architecture, formed by a balanced encoder-decoder network, the LKPP block with densely expanding receptive field, and the new edge-aware loss applied directly to the prediction map, produces more robust and discriminative features that improve both multi-scale object detection and boundary accuracy.

What carries the argument

The large kernel spatial pyramid pooling (LKPP) block that creates a densely expanding receptive field for multi-scale feature extraction and fusion, together with the edge-aware loss that operates directly on the semantic segmentation prediction.

If this is right

  • Models can handle multi-scale objects more reliably in both urban driving scenes and indoor environments.
  • Adjacent objects with similar appearance become easier to separate without extra post-processing.
  • Semantic consistency inside single objects improves because boundary signals feed back into feature learning.
  • The same loss can be attached to other encoder-decoder backbones to gain boundary refinement without redesigning the whole network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The edge-aware loss could be tested as a plug-in module on existing state-of-the-art segmentation networks to measure isolated gains.
  • Large-kernel pyramid designs might transfer to other dense-prediction tasks such as depth estimation or surface normal prediction.
  • Evaluating the model on additional datasets like ADE20K would reveal whether the gains hold beyond the three reported benchmarks.

Load-bearing premise

That the edge-aware loss function refines boundaries directly from the semantic segmentation prediction to yield more robust and discriminative features.

What would settle it

If ELKPPNet fails to exceed the accuracy of the strongest competing methods on the Cityscapes validation set when trained and evaluated under identical conditions and protocols, the superiority claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.11428 by Hanjiang Xiong, Jianya Gong, Linxi Huan, Xianwei Zheng.

Figure 8
Figure 8. Figure 8: Zoomed-in qualitative results. 4.4.3 Evaluation of Edge-aware Loss Function A further evaluation was also made for the proposed edge-aware cross-entropy loss function (also referred as ECE loss). Resnet-50 with the proposed balanced encoder-decoder framework was applied as the baseline network, and the two loss functions, i.e., CE loss and the proposed ECE loss, were first studied on the baseline network. … view at source ↗
read the original abstract

Semantic segmentation has been a hot topic across diverse research fields. Along with the success of deep convolutional neural networks, semantic segmentation has made great achievements and improvements, in terms of both urban scene parsing and indoor semantic segmentation. However, most of the state-of-the-art models are still faced with a challenge in discriminative feature learning, which limits the ability of a model to detect multi-scale objects and to guarantee semantic consistency inside one object or distinguish different adjacent objects with similar appearance. In this paper, a practical and efficient edge-aware neural network is presented for semantic segmentation. This end-to-end trainable engine consists of a new encoder-decoder network, a large kernel spatial pyramid pooling (LKPP) block, and an edge-aware loss function. The encoder-decoder network was designed as a balanced structure to narrow the semantic and resolution gaps in multi-level feature aggregation, while the LKPP block was constructed with a densely expanding receptive field for multi-scale feature extraction and fusion. Furthermore, the new powerful edge-aware loss function is proposed to refine the boundaries directly from the semantic segmentation prediction for more robust and discriminative features. The effectiveness of the proposed model was demonstrated using Cityscapes, CamVid, and NYUDv2 benchmark datasets. The performance of the two structures and the edge-aware loss function in ELKPPNet was validated on the Cityscapes dataset, while the complete ELKPPNet was evaluated on the CamVid and NYUDv2 datasets. A comparative analysis with the state-of-the-art methods under the same conditions confirmed the superiority of the proposed algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes ELKPPNet, an end-to-end trainable encoder-decoder network augmented with a large kernel pyramid pooling (LKPP) block and an edge-aware loss function, for semantic segmentation. It claims that the balanced encoder-decoder narrows semantic and resolution gaps, the LKPP provides densely expanding receptive fields for multi-scale fusion, and the edge-aware loss refines boundaries directly from predictions to yield more discriminative features, resulting in superior performance over state-of-the-art methods on the Cityscapes, CamVid, and NYUDv2 benchmarks under comparable conditions, with component ablations reported on Cityscapes.

Significance. If the reported gains hold, the work supplies a practical architecture combining multi-scale pooling with boundary-aware supervision that could aid urban scene parsing and indoor segmentation tasks. The provision of Cityscapes ablations plus cross-dataset evaluation on two additional benchmarks supplies external grounding for the central empirical claim. The explicit design of the LKPP block and the end-to-end formulation are concrete contributions that can be directly compared by subsequent work.

major comments (2)
  1. [§4.2] §4.2 (Cityscapes ablation table): the incremental mIoU gains attributed to the edge-aware loss are reported without standard deviations across multiple random seeds or statistical tests; this weakens the claim that the loss produces reliably more robust features, as the observed deltas could fall within run-to-run variance.
  2. [§3.3] §3.3 (edge-aware loss): the formulation is stated to refine boundaries 'directly from the semantic segmentation prediction,' yet the loss expression incorporates ground-truth edge maps; this mismatch between the textual claim and the actual supervision signal is load-bearing for the interpretation of how discriminative features are learned.
minor comments (3)
  1. [Figure 2] Figure 2 (LKPP block diagram): the kernel sizes and dilation rates inside the pyramid levels are not numerically annotated on the figure itself, forcing the reader to cross-reference the text.
  2. [Table 1] Table 1 (Cityscapes results): the column headers for 'Params' and 'FPS' are present but the corresponding values for the proposed model are omitted in one row, breaking direct efficiency comparison.
  3. [§5] §5 (NYUDv2 evaluation): the protocol states 'same conditions' as prior work, yet the exact training schedule, crop size, and data augmentation details are only summarized rather than tabulated against the cited baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. The two major comments are addressed point-by-point below with honest responses on what can be revised.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Cityscapes ablation table): the incremental mIoU gains attributed to the edge-aware loss are reported without standard deviations across multiple random seeds or statistical tests; this weakens the claim that the loss produces reliably more robust features, as the observed deltas could fall within run-to-run variance.

    Authors: We agree that the absence of standard deviations or statistical tests in the §4.2 ablation table limits the strength of claims about reliable improvements from the edge-aware loss. The reported results were obtained from single training runs, which was standard practice at the time given the high computational cost of Cityscapes experiments. In the revised manuscript we will add an explicit note acknowledging this limitation and the possibility that small deltas may lie within run-to-run variance; we will also report standard deviations for the key ablations if additional compute can be secured. revision: partial

  2. Referee: [§3.3] §3.3 (edge-aware loss): the formulation is stated to refine boundaries 'directly from the semantic segmentation prediction,' yet the loss expression incorporates ground-truth edge maps; this mismatch between the textual claim and the actual supervision signal is load-bearing for the interpretation of how discriminative features are learned.

    Authors: We thank the referee for identifying this inconsistency in §3.3. The edge-aware loss does use ground-truth edge maps (extracted from the semantic labels) together with the model's semantic segmentation prediction to supervise boundary refinement. The original wording was imprecise and overstated the degree to which refinement occurs solely from the prediction. We will revise the description in §3.3 to accurately state that the loss combines the prediction with GT edge maps, thereby clarifying how the supervision signal contributes to more discriminative features. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an encoder-decoder architecture, LKPP block, and edge-aware loss, then reports empirical results on public benchmarks (Cityscapes, CamVid, NYUDv2) with SOTA comparisons under matched conditions. No equations, derivations, or self-citations are shown that reduce any claimed result to its inputs by construction; performance claims rest on external dataset evaluations rather than internal fitting or renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions about convolutional feature learning plus empirical tuning of new architectural elements on benchmark data.

free parameters (1)
  • kernel sizes and pyramid levels in LKPP
    Design choices for expanding receptive field, selected to achieve multi-scale fusion and likely optimized on validation splits.
axioms (1)
  • domain assumption Convolutional encoder-decoder networks augmented with multi-scale pooling and edge supervision can learn more discriminative features for semantic segmentation.
    Core premise invoked to justify the three components.

pith-pipeline@v0.9.0 · 5828 in / 1202 out tokens · 30267 ms · 2026-05-25T15:12:07.603232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    ELKPPNet: An Edge-aware Neural Network with Large Kernel Pyramid Pooling for Learning Discriminative Features in Semantic Segmentation Xianwei Zheng1,*, Linxi Huan1, Hanjiang Xiong1, Jianya Gong1,2 1The State Key Laboratory of Information Engineering in Su rveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China 2School of Remote Sensing and In...

  2. [2]

    2017; Zhang et al

    , urban 3D semantic modeling (Rouhani et al. 2017; Zhang et al

  3. [3]

    2018; Politz and Sester

    and remote sensing image classification (Kemker et al. 2018; Politz and Sester

  4. [4]

    Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes

    Semantic segmentation developed slowly because the various objects found in an image limit the efficiency of automatic scene parsing, until the popularization of deep learning. Since then, deep convolutional neural networks (DCNNs) have enabled semantic segmentation algorithms to achieve remarkable progress in RGB scenes. As DCNNs have the ability to lear...

  5. [5]

    methods based on image pyramids (Zhao et al. 2018)

  6. [6]

    2017); and

    methods applying an encoder -decoder structure (Badrinarayanan et al. 2017); and

  7. [7]

    2017; Chen et al

    methods deploying spatial pyramid pooling (SPP) (Zhao et al. 2017; Chen et al

  8. [8]

    Existing methods for multi-scale context extraction (Chen et al. 2017). The other difficulty for precise semantic segmentation lies in detail refinement. Most deep learning methods are not sensitive to detail information, and thus they often cannot maintain semantic consistency inside a single object (intra-class inconsistency) or distinguish two semantic...

  9. [9]

    gridding

    and Zhou et al. (2018) . This approach also refines the semantic boundary for prediction using geometrical information from the low -level features. The SPP module, i.e., LKPP, is constructed with large kernels with hybrid asymmetric dilated convolutions to overcome the limitations of the existing SPP modules. The LKPP module can encode rich spatial infor...

  10. [10]

    Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion

    2 Related work 2.1 Multi-scale Object Detection Scale variance of objects occurs frequently in natural and remote sensing images, and influences the learning ability of deep networks for semantic segmentation. Many researches have focused on enhancing the robustness to scale variance by view field enlargement and effective multi -level feature fusion. The...

  11. [11]

    to model region similarities (Zheng 2015 ; Li 2016; Chen 2016), and some adopted several sequential convolutional layers to extract long-range information (Yu and Koltun 2016; Liu et al. 2015). DenseASPP involves organizing atrous convolutional layers with increasing rates in a dense fashion to enlarge receptive filed size (Yang et al

  12. [12]

    Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation

    However, in practice, the extra subnetwork brings heavy computational complexity and a high memory footprint. Structures such as DenseASPP, in particular, suffer from high computational cost and optimization trouble coming from the dense connection and concatenation. The encoder-decoder framework achieves multi-level feature aggregation by merging low-lev...

  13. [13]

    (2016) and Li et al

    Jegou et al. (2016) and Li et al. (2019) constructed dense multi -scale connections for fe ature aggregation, and Yu et al. (2018) hierarchically fused multi-level features by deep layer aggregation. However, these methods often need well -designed aggregation structures, which require prior knowledge and introduce a large number of parameters, which come...

  14. [14]

    gridding

    the “gridding” problem, which happens when the view field is enlarged by dilated convolutional layers (Wang et al. 2018). In the proposed network, the balanced encoder -decoder framework is capable of efficient and computation-saving multi -level feature aggregation, and the novel spatial pyramid pooling module — LKPP—can obtain highly rich contextual fea...

  15. [15]

    intra-class inconsistency

    , while Yu et al. (2018) combined semantic segmentation and boundary detection by two subnetworks —Smooth Network and Border Network —to address the “intra-class inconsistency” issue and enlarge the “inter-class distinction”. Jiang et al. (2017), Lee et al. (2017) and Marmanis et al. (2018) extracted edge features from DEM data or a depth map. However, te...

  16. [16]

    3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig

    The whole network architecture of the proposed ELKPPNet. 3.1 The Workflow of ELKPPNet The proposed ELKPPNet framework, as illustrated in Fig. 2, features a residual network as an encoder, and a decoder followed by a classifier layer and an edge extractor. ELKPPNet takes an RGB image as input, and outputs a semantic segmentation prediction at the classifie...

  17. [17]

    It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect

    with the proposed LKPP module. It can be seen that gridding effect produces checkboard-like patterns, while the proposed LKPP effectively eliminate such an effect. As demonstrated in Chen et al. (2017), the larger the dilation rate grows, the small the number of effective kernel weights will become. For example, if the filter size is close to the feature ...

  18. [18]

    gridding

    , but such a solution also causes the problem known as “gridding” (Wang et al. 2018). Taking 3k  and 2r  for illustration, if a group of sequential convolutional layers have the same rate r , then given an arbitrary pixel p of the top layer il , its receptive field is formed in a checkerboard fashion, meaning much of the information from the input is di...

  19. [19]

    gridding

    is a solution to address the ‘gridding’ issue. (a) (b) (c) (d) Layer3, rate=2 Layer2, rate=2 Layer1, rate=2 Layer3, rate=3 Layer2, rate=2 Layer1, rate=1 Rate=1 Rate=12Rate=3 Given N convolutional layers  1,..., Nll with kernels of size kk chained in cascade, and  1,..., Nrr denote their dilation rates, we can define the maximum distance between nonze...

  20. [20]

    gridding

    The two-layer convolution in an HADC block. (a) Large Kernel Pyramid Pooling. (b) The HADC block in parallel LKPP. (c) The HADC block in cascade LKPP. Cascade LKPP: In cascade LKPP, each HADC branch consists of three two-layer pairs, and the layers in each pair are joi ned sequentially, which can greatly expand receptive field size, and therefore is appli...

  21. [21]

    intra-class inconsistency

    Edge Extractor. (a) Edge map under different k . Left: Edge map with =1k ; Right: Edge map with =3k . (b) The mechanism of edge extractor. Edge detection is a binary classification problem, but the gradient map only contains semantic edge information an d optimizing semantic edge, may introduce unnecessary computation and require more GPU memory, as in Li...

  22. [22]

    2015; Cordts et al

    network was selected as the backbone for all the m odels, and the experiments were conducted on three challen ging semantic segmentation datasets: the Cityscapes (Cordts et al. 2015; Cordts et al. 2016), and CamVid (Fauqueur et al. 2007; Badrinarayanan et al

  23. [23]

    2012; Lee et al

    outdoor datasets and the NYUDv2 indoor scene parsing benchmark dataset (Silberman et al. 2012; Lee et al. 2017). Ablation studies were first conducted on the Cityscapes dataset to validate the proposed balanced encoder-decoder structure, the LKPP module, and the ECE loss function, respectively. To allow a comprehensive evaluation, the whole ELKPPNet was f...

  24. [24]

    and PSPNet (Zhao et al. 2017). In all the experiments, except for mirror flip, no extra training tricks were used, especially those related to detail augmentation and multi-scale detection, because other training tricks add more random information, making it difficult to determine whether the discriminative feature learning ability is boosted by the given...

  25. [25]

    Prediction results of U-Net and the balanced encoder-decoder. In (c), the U-Net structure yields droplet-like over-smoothed patches, which even erase the corners of the traffic sign (yellow) and distort its square shape into a nearly round one. In (d), the proposed balanced encoder-decoder framework more precisely draws out the contours of trees, pedestri...

  26. [26]

    gridding

    In DilatedNet, the final upsampling operation helps to remove the visible “gridding” problem in the final output. The dilation rates of ASPP and DenseASPP were the same as Chen et al. (2018) (i.e., ASPP module with dilation rate of 1, 6, 12,

  27. [27]

    gridding

    and Yang et al. (2018) (i.e., DenseASPP with dilation rate of 3, 6, 12, 18, 24). The kernels used in the LKPP module were set to 3×3, 3×5 (5×3), and 3×7 (7×3), and the rates in every HADC of the LKPP module were set as 1,2,3, to avoid superfluous invalid information caused by zero values introduced by large di lations. The baseline was a ResNet -50 networ...

  28. [28]

    Quantitive analysis on 37-class NYUDv2 dataset (unit: %). Metric Model mIoU FWIoU PixelAcc MeanClassAcc Deeplabv3 28.51 48.32 64.27 34.48 Deeplabv3+ 29.30 50.09 65.69 35.03 DenseASPP 30.77 50.53 67.13 35.36 PSPNet 24.11 45.75 61.18 29.93 RefineNet 29.40 50.79 66.92 34.43 Our ELKPPNet (parallel) 34.41 55.11 70.03 39.00 Test Image Ground Truth DeepLabV3 Dee...

  29. [29]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    Tensorflow: Large- scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. https://arxiv.org/abs/1603.04467 Badrinarayanan V ., Kendall A., Cipolla R.,

  30. [30]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. https://arxiv.org/abs/1706.05587 Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S. and Schiele B.,

  31. [31]

    IEEE International Conference on Computer Vision

    Predicting Depth, Surface Normals and Semantic La bels with a Common Multi-scale Convolutional Architecture. IEEE International Conference on Computer Vision. Farabet C., Couprie C., Najman L., Lecun Y ., 2013, Learning Hierarchical Features for Scene Labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915-1929. Fauqueur J., B...

  32. [32]

    2007 IEEE International Conference on Computer Vision, 1-7, IEEE

    Assisted video object labeling by joint tracking of regions and keypoints. 2007 IEEE International Conference on Computer Vision, 1-7, IEEE. Gonzalez R. and Woods R.,

  33. [33]

    IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-

    Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 37(9):1904-

  34. [34]

    2017 IEEE International Conference on Software Engine ering and Service Science

    Incorporating depth into both CNN and CRF for indoor semantic segmentation. 2017 IEEE International Conference on Software Engine ering and Service Science. Kemker R., Salvaggio C. and Kanan C.,

  35. [35]

    2017 IEEE International Conference on Computer Vision

    RDFNet: RGB -D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation. 2017 IEEE International Conference on Computer Vision. Li H., Xiong P., Fan H. and Sun J.,

  36. [36]

    DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation

    DFANet: Deep Feature Aggregation for Real -Time Semantic Segmentation. arXiv preprint arXiv:1904.02216. https://arxiv.org/abs/1904.02216 Li W. and Yang M.,

  37. [37]

    2017 IEEE Conference on Computer Vision and Pattern Recognition

    RefineNet: Multi -path Refinement Networks for High - Resolution Semantic Segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition. Lin T., Goyal P., Girshick R., He K. and Piotr D.,

  38. [38]

    arXiv preprint arXiv:1804.02864

    Semantic edge detection with diverse deep supervision. arXiv preprint arXiv:1804.02864. https://arxiv.org/abs/1804.02864 Liu Y ., Cheng M., Hu X., Wang K. and Bai X.,

  39. [39]

    International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr

    Exploring ALS and DIM data for semantic segmentation using CNNs. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 42 (2018), Nr. 1 42(1): 347-354. Ronneberger O., Fischer P. and Brox T.,

  40. [40]

    2018 IEEE Winter Conference on Applications of Computer Vision (pp

    Understanding convolution for semantic segmentation. 2018 IEEE Winter Conference on Applications of Computer Vision (pp. 1451-1460) Xiao J., Owens A. and Torralba A.,