ASCNet: Adaptive-Scale Convolutional Neural Networks for Multi-Scale Feature Learning

Jie Zhao; Li Zhang; Mo Zhang; Quanzheng Li; Xiang Li

arxiv: 1907.03241 · v1 · pith:YZOSTE7Xnew · submitted 2019-07-07 · 💻 cs.CV

ASCNet: Adaptive-Scale Convolutional Neural Networks for Multi-Scale Feature Learning

Mo Zhang , Jie Zhao , Xiang Li , Li Zhang , Quanzheng Li This is my paper

Pith reviewed 2026-05-25 01:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords adaptive dilationsemantic segmentationmulti-scale featuresmedical image segmentationdilated convolutionconvolutional neural networksper-pixel rates

0 comments

The pith

ASCNet learns a dilation rate for each pixel via a 3-layer structure to fit receptive fields to objects of varying sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the fixed receptive fields of standard dilated convolutions, which cannot adapt to objects of different sizes in an image. It introduces ASCNet, an end-to-end network that inserts a 3-layer convolution module to generate a unique dilation rate per pixel. This produces scale-appropriate receptive fields without the cost of larger kernels or the loss from pooling. Tests on the Herlev and SCD RBC medical image datasets show ASCNet reaching the highest segmentation accuracy, with the generated rates correlating positively to object sizes.

Core claim

By adding a 3-layer convolution structure to the network, ASCNet learns per-pixel dilation rates during training that create optimal receptive fields matched to each object's size, enabling more accurate semantic segmentation than either classic CNNs or fixed-rate dilated CNNs on the tested medical datasets.

What carries the argument

The 3-layer convolution structure that outputs per-pixel adaptive dilation rates.

If this is right

ASCNet achieves the highest segmentation accuracy on the Herlev and SCD RBC datasets.
The automatically generated dilation rates increase with object size.
The approach extracts multi-scale information without fixed rates or extra computational cost from kernel expansion.
Pixel-level rates avoid the information loss that comes from maximum pooling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-pixel adaptation idea could be tested on natural-image datasets that contain large scale variation.
If the learned rates prove stable across domains, the method might reduce the need for explicit multi-scale modules such as feature pyramids.
One could measure whether the 3-layer module adds measurable latency at inference time on the same hardware used for the original experiments.

Load-bearing premise

The 3-layer structure can learn per-pixel dilation rates that generalize beyond the training set and that the positive correlation with object size demonstrates the rates are effective rather than an artifact.

What would settle it

If additional datasets show no accuracy gain over fixed-rate dilated networks or no positive correlation between learned rates and measured object sizes, the central claim would be challenged.

Figures

Figures reproduced from arXiv: 1907.03241 by Jie Zhao, Li Zhang, Mo Zhang, Quanzheng Li, Xiang Li.

**Figure 2.** Figure 2: Architecture of the adaptive-scale convolutional neural network. Note [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Two examples of the segmentation results of the different models on the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the learned dilation rate. (a) Examples of the segmenta [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Extracting multi-scale information is key to semantic segmentation. However, the classic convolutional neural networks (CNNs) encounter difficulties in achieving multi-scale information extraction: expanding convolutional kernel incurs the high computational cost and using maximum pooling sacrifices image information. The recently developed dilated convolution solves these problems, but with the limitation that the dilation rates are fixed and therefore the receptive field cannot fit for all objects with different sizes in the image. We propose an adaptivescale convolutional neural network (ASCNet), which introduces a 3-layer convolution structure in the end-to-end training, to adaptively learn an appropriate dilation rate for each pixel in the image. Such pixel-level dilation rates produce optimal receptive fields so that the information of objects with different sizes can be extracted at the corresponding scale. We compare the segmentation results using the classic CNN, the dilated CNN and the proposed ASCNet on two types of medical images (The Herlev dataset and SCD RBC dataset). The experimental results show that ASCNet achieves the highest accuracy. Moreover, the automatically generated dilation rates are positively correlated to the sizes of the objects, confirming the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASCNet learns per-pixel dilation rates with a small auxiliary net, but the abstract gives no numbers or controls so the correlation claim is hard to evaluate.

read the letter

The new piece is the 3-layer convolution module that predicts a dilation rate for every pixel and plugs it into the main segmentation network during end-to-end training. That is mechanically different from fixed-rate dilated convs or from attention-style scale selection in earlier work. On the two medical datasets mentioned, the method reportedly beats both plain CNNs and standard dilated CNNs while producing rates that line up with object size. That is the extent of the positive evidence in the abstract. The soft spots are straightforward. No accuracy numbers, no standard deviations, no ablation that holds the rest of the architecture fixed and swaps learned rates for fixed ones, and no test that checks whether the rates actually change usefully on new images rather than just echoing training-set statistics. The positive correlation is offered as proof the adaptation works, yet nothing rules out the auxiliary net simply regressing to typical object scales in the training distribution. Without those controls the causal claim stays unsecured. This is the kind of incremental architecture tweak that matters most to people already working on multi-scale medical segmentation. A reader who wants to see whether the per-pixel mechanism delivers measurable gains beyond a well-tuned fixed-dilation baseline would get value from the full paper. The idea is coherent enough on its own terms to deserve referee time if the manuscript supplies the missing quantitative checks and implementation details.

Referee Report

2 major / 0 minor

Summary. The paper proposes ASCNet, which augments standard CNNs for semantic segmentation with a 3-layer convolution module that learns per-pixel dilation rates in an end-to-end fashion. This is intended to produce object-size-appropriate receptive fields without the cost of large kernels or the information loss of pooling. Experiments on the Herlev and SCD RBC medical-image datasets are reported to show that ASCNet attains the highest accuracy among compared methods (classic CNN, fixed-dilation CNN) and that the learned dilation rates correlate positively with object size, which the authors interpret as confirmation of the method's effectiveness.

Significance. If the per-pixel adaptation demonstrably improves segmentation beyond fixed-dilation baselines on varied object scales, the approach would offer a practical route to multi-scale feature extraction in domains such as medical imaging. The absence of reported quantitative metrics, ablations, or statistical tests in the provided abstract, however, leaves the magnitude and robustness of any gain unclear.

major comments (2)

[Abstract] Abstract: the claim that 'ASCNet achieves the highest accuracy' is presented without any numerical values, standard deviations, or statistical tests, preventing assessment of whether the improvement is practically meaningful or merely within noise.
[Abstract] Abstract: the assertion that the positive correlation between learned dilation rates and object sizes 'confirm[s] the effectiveness of the proposed method' is load-bearing for the central contribution, yet no ablation (fixed-rate baseline, shuffled-size control, or out-of-distribution size test) is described; without such controls the correlation could arise from the network regressing to dataset-wide scale statistics rather than from useful per-pixel adaptation at test time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract. We will revise the abstract to include quantitative results from the experiments. For the correlation claim, we note that the existing fixed-dilation baseline already provides relevant evidence, though we acknowledge the value of additional controls.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'ASCNet achieves the highest accuracy' is presented without any numerical values, standard deviations, or statistical tests, preventing assessment of whether the improvement is practically meaningful or merely within noise.

Authors: We agree that the abstract would benefit from quantitative support. The revised abstract will report the specific segmentation accuracy metrics (e.g., Dice or IoU scores) for ASCNet versus the classic CNN and fixed-dilation CNN on both the Herlev and SCD RBC datasets. Where multiple runs exist in the full experiments, standard deviations will be added. revision: yes
Referee: [Abstract] Abstract: the assertion that the positive correlation between learned dilation rates and object sizes 'confirm[s] the effectiveness of the proposed method' is load-bearing for the central contribution, yet no ablation (fixed-rate baseline, shuffled-size control, or out-of-distribution size test) is described; without such controls the correlation could arise from the network regressing to dataset-wide scale statistics rather than from useful per-pixel adaptation at test time.

Authors: The manuscript already compares ASCNet against a fixed-dilation CNN, which serves as the fixed-rate baseline and demonstrates performance gains from adaptation. The positive correlation is shown in the results section as corroborating evidence. Shuffled-size controls and out-of-distribution size tests were not performed; the current fixed-baseline comparison and end-to-end training provide the primary support that adaptation is task-relevant rather than purely statistical. We will clarify this distinction in the revised text. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results do not reduce to inputs by construction

full rationale

The paper introduces a 3-layer convolution module to predict per-pixel dilation rates, trains the network end-to-end on segmentation tasks, and reports higher accuracy plus a post-training correlation between learned rates and object sizes. No equations, uniqueness theorems, or self-citations are invoked to derive the accuracy gain or the correlation; both are presented as measured outcomes on held-out test images. The correlation is offered as supporting observation rather than a quantity that is definitionally forced by the fitting procedure, and no load-bearing claim collapses to a renamed input or self-referential ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The learned dilation rates are treated as outputs of the network rather than hand-specified constants.

pith-pipeline@v0.9.0 · 5732 in / 1046 out tokens · 18245 ms · 2026-05-25T01:42:31.537066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

[1]

IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

work page 2018
[2]

In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., Hikosaka, S.: Eﬀective use of dilated convolutions for segmenting small object instances in remote sensing imagery. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1442–1450. IEEE (2018)

work page 2018
[3]

Nature inspired Smart Information Systems (NiSIS 2005) pp

Jantzen, J., Norup, J., Dounias, G., Bjerregaard, B.: Pap-smear benchmark data for pattern classiﬁcation. Nature inspired Smart Information Systems (NiSIS 2005) pp. 1–9 (2005)

work page 2005
[4]

Dense Transformer Networks

Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense transformer networks. arXiv preprint arXiv:1705.08881 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)

work page 2018
[6]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

work page 2015
[7]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Eﬃcient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 552–568 (2018)

work page 2018
[8]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

work page 2017
[9]

In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP)

Shi, W., Jiang, F., Zhao, D.: Single image super-resolution with dilated convolution based multi-scale information learning inception module. In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 977–981. IEEE (2017)

work page 2017
[10]

In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Under- standing convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1451–1460. IEEE (2018) 9

work page 2018
[11]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 472–480 (2017)

work page 2017
[13]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Zhang, M., Li, X., Xu, M., Li, Q.: Rbc semantic segmentation for sickle cell dis- ease based on deformable u-net. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 695–702. Springer (2018)

work page 2018
[14]

Automated Segmentation of Cervical Nuclei in Pap Smear Images using Deformable Multi-path Ensemble Model

Zhao, J., Li, Q., Li, H., Zhang, L.: Automated segmentation of cervical nuclei in pap smear images using deformable multi-path ensemble model. arXiv preprint arXiv:1812.00527 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

work page 2018

[2] [2]

In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., Hikosaka, S.: Eﬀective use of dilated convolutions for segmenting small object instances in remote sensing imagery. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1442–1450. IEEE (2018)

work page 2018

[3] [3]

Nature inspired Smart Information Systems (NiSIS 2005) pp

Jantzen, J., Norup, J., Dounias, G., Bjerregaard, B.: Pap-smear benchmark data for pattern classiﬁcation. Nature inspired Smart Information Systems (NiSIS 2005) pp. 1–9 (2005)

work page 2005

[4] [4]

Dense Transformer Networks

Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense transformer networks. arXiv preprint arXiv:1705.08881 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)

work page 2018

[6] [6]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

work page 2015

[7] [7]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Eﬃcient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 552–568 (2018)

work page 2018

[8] [8]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

work page 2017

[9] [9]

In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP)

Shi, W., Jiang, F., Zhao, D.: Single image super-resolution with dilated convolution based multi-scale information learning inception module. In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 977–981. IEEE (2017)

work page 2017

[10] [10]

In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Under- standing convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1451–1460. IEEE (2018) 9

work page 2018

[11] [11]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 472–480 (2017)

work page 2017

[13] [13]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Zhang, M., Li, X., Xu, M., Li, Q.: Rbc semantic segmentation for sickle cell dis- ease based on deformable u-net. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 695–702. Springer (2018)

work page 2018

[14] [14]

Automated Segmentation of Cervical Nuclei in Pap Smear Images using Deformable Multi-path Ensemble Model

Zhao, J., Li, Q., Li, H., Zhang, L.: Automated segmentation of cervical nuclei in pap smear images using deformable multi-path ensemble model. arXiv preprint arXiv:1812.00527 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018