pith. sign in

arxiv: 1907.03241 · v1 · pith:YZOSTE7Xnew · submitted 2019-07-07 · 💻 cs.CV

ASCNet: Adaptive-Scale Convolutional Neural Networks for Multi-Scale Feature Learning

Pith reviewed 2026-05-25 01:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords adaptive dilationsemantic segmentationmulti-scale featuresmedical image segmentationdilated convolutionconvolutional neural networksper-pixel rates
0
0 comments X

The pith

ASCNet learns a dilation rate for each pixel via a 3-layer structure to fit receptive fields to objects of varying sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the fixed receptive fields of standard dilated convolutions, which cannot adapt to objects of different sizes in an image. It introduces ASCNet, an end-to-end network that inserts a 3-layer convolution module to generate a unique dilation rate per pixel. This produces scale-appropriate receptive fields without the cost of larger kernels or the loss from pooling. Tests on the Herlev and SCD RBC medical image datasets show ASCNet reaching the highest segmentation accuracy, with the generated rates correlating positively to object sizes.

Core claim

By adding a 3-layer convolution structure to the network, ASCNet learns per-pixel dilation rates during training that create optimal receptive fields matched to each object's size, enabling more accurate semantic segmentation than either classic CNNs or fixed-rate dilated CNNs on the tested medical datasets.

What carries the argument

The 3-layer convolution structure that outputs per-pixel adaptive dilation rates.

If this is right

  • ASCNet achieves the highest segmentation accuracy on the Herlev and SCD RBC datasets.
  • The automatically generated dilation rates increase with object size.
  • The approach extracts multi-scale information without fixed rates or extra computational cost from kernel expansion.
  • Pixel-level rates avoid the information loss that comes from maximum pooling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-pixel adaptation idea could be tested on natural-image datasets that contain large scale variation.
  • If the learned rates prove stable across domains, the method might reduce the need for explicit multi-scale modules such as feature pyramids.
  • One could measure whether the 3-layer module adds measurable latency at inference time on the same hardware used for the original experiments.

Load-bearing premise

The 3-layer structure can learn per-pixel dilation rates that generalize beyond the training set and that the positive correlation with object size demonstrates the rates are effective rather than an artifact.

What would settle it

If additional datasets show no accuracy gain over fixed-rate dilated networks or no positive correlation between learned rates and measured object sizes, the central claim would be challenged.

Figures

Figures reproduced from arXiv: 1907.03241 by Jie Zhao, Li Zhang, Mo Zhang, Quanzheng Li, Xiang Li.

Figure 1
Figure 1. Figure 1: The sampling process of ASC module. In the left panel, the colored dots [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the adaptive-scale convolutional neural network. Note [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two examples of the segmentation results of the different models on the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the learned dilation rate. (a) Examples of the segmenta [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Extracting multi-scale information is key to semantic segmentation. However, the classic convolutional neural networks (CNNs) encounter difficulties in achieving multi-scale information extraction: expanding convolutional kernel incurs the high computational cost and using maximum pooling sacrifices image information. The recently developed dilated convolution solves these problems, but with the limitation that the dilation rates are fixed and therefore the receptive field cannot fit for all objects with different sizes in the image. We propose an adaptivescale convolutional neural network (ASCNet), which introduces a 3-layer convolution structure in the end-to-end training, to adaptively learn an appropriate dilation rate for each pixel in the image. Such pixel-level dilation rates produce optimal receptive fields so that the information of objects with different sizes can be extracted at the corresponding scale. We compare the segmentation results using the classic CNN, the dilated CNN and the proposed ASCNet on two types of medical images (The Herlev dataset and SCD RBC dataset). The experimental results show that ASCNet achieves the highest accuracy. Moreover, the automatically generated dilation rates are positively correlated to the sizes of the objects, confirming the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes ASCNet, which augments standard CNNs for semantic segmentation with a 3-layer convolution module that learns per-pixel dilation rates in an end-to-end fashion. This is intended to produce object-size-appropriate receptive fields without the cost of large kernels or the information loss of pooling. Experiments on the Herlev and SCD RBC medical-image datasets are reported to show that ASCNet attains the highest accuracy among compared methods (classic CNN, fixed-dilation CNN) and that the learned dilation rates correlate positively with object size, which the authors interpret as confirmation of the method's effectiveness.

Significance. If the per-pixel adaptation demonstrably improves segmentation beyond fixed-dilation baselines on varied object scales, the approach would offer a practical route to multi-scale feature extraction in domains such as medical imaging. The absence of reported quantitative metrics, ablations, or statistical tests in the provided abstract, however, leaves the magnitude and robustness of any gain unclear.

major comments (2)
  1. [Abstract] Abstract: the claim that 'ASCNet achieves the highest accuracy' is presented without any numerical values, standard deviations, or statistical tests, preventing assessment of whether the improvement is practically meaningful or merely within noise.
  2. [Abstract] Abstract: the assertion that the positive correlation between learned dilation rates and object sizes 'confirm[s] the effectiveness of the proposed method' is load-bearing for the central contribution, yet no ablation (fixed-rate baseline, shuffled-size control, or out-of-distribution size test) is described; without such controls the correlation could arise from the network regressing to dataset-wide scale statistics rather than from useful per-pixel adaptation at test time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract. We will revise the abstract to include quantitative results from the experiments. For the correlation claim, we note that the existing fixed-dilation baseline already provides relevant evidence, though we acknowledge the value of additional controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'ASCNet achieves the highest accuracy' is presented without any numerical values, standard deviations, or statistical tests, preventing assessment of whether the improvement is practically meaningful or merely within noise.

    Authors: We agree that the abstract would benefit from quantitative support. The revised abstract will report the specific segmentation accuracy metrics (e.g., Dice or IoU scores) for ASCNet versus the classic CNN and fixed-dilation CNN on both the Herlev and SCD RBC datasets. Where multiple runs exist in the full experiments, standard deviations will be added. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that the positive correlation between learned dilation rates and object sizes 'confirm[s] the effectiveness of the proposed method' is load-bearing for the central contribution, yet no ablation (fixed-rate baseline, shuffled-size control, or out-of-distribution size test) is described; without such controls the correlation could arise from the network regressing to dataset-wide scale statistics rather than from useful per-pixel adaptation at test time.

    Authors: The manuscript already compares ASCNet against a fixed-dilation CNN, which serves as the fixed-rate baseline and demonstrates performance gains from adaptation. The positive correlation is shown in the results section as corroborating evidence. Shuffled-size controls and out-of-distribution size tests were not performed; the current fixed-baseline comparison and end-to-end training provide the primary support that adaptation is task-relevant rather than purely statistical. We will clarify this distinction in the revised text. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results do not reduce to inputs by construction

full rationale

The paper introduces a 3-layer convolution module to predict per-pixel dilation rates, trains the network end-to-end on segmentation tasks, and reports higher accuracy plus a post-training correlation between learned rates and object sizes. No equations, uniqueness theorems, or self-citations are invoked to derive the accuracy gain or the correlation; both are presented as measured outcomes on held-out test images. The correlation is offered as supporting observation rather than a quantity that is definitionally forced by the fitting procedure, and no load-bearing claim collapses to a renamed input or self-referential ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The learned dilation rates are treated as outputs of the network rather than hand-specified constants.

pith-pipeline@v0.9.0 · 5732 in / 1046 out tokens · 18245 ms · 2026-05-25T01:42:31.537066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

  2. [2]

    In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., Hikosaka, S.: Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1442–1450. IEEE (2018)

  3. [3]

    Nature inspired Smart Information Systems (NiSIS 2005) pp

    Jantzen, J., Norup, J., Dounias, G., Bjerregaard, B.: Pap-smear benchmark data for pattern classification. Nature inspired Smart Information Systems (NiSIS 2005) pp. 1–9 (2005)

  4. [4]

    Dense Transformer Networks

    Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense transformer networks. arXiv preprint arXiv:1705.08881 (2017)

  5. [5]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)

  6. [6]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

  7. [7]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 552–568 (2018)

  8. [8]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

  9. [9]

    In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP)

    Shi, W., Jiang, F., Zhao, D.: Single image super-resolution with dilated convolution based multi-scale information learning inception module. In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 977–981. IEEE (2017)

  10. [10]

    In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Under- standing convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1451–1460. IEEE (2018) 9

  11. [11]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  12. [12]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 472–480 (2017)

  13. [13]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Zhang, M., Li, X., Xu, M., Li, Q.: Rbc semantic segmentation for sickle cell dis- ease based on deformable u-net. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 695–702. Springer (2018)

  14. [14]

    Automated Segmentation of Cervical Nuclei in Pap Smear Images using Deformable Multi-path Ensemble Model

    Zhao, J., Li, Q., Li, H., Zhang, L.: Automated segmentation of cervical nuclei in pap smear images using deformable multi-path ensemble model. arXiv preprint arXiv:1812.00527 (2018)