ASCNet: Adaptive-Scale Convolutional Neural Networks for Multi-Scale Feature Learning
Pith reviewed 2026-05-25 01:42 UTC · model grok-4.3
The pith
ASCNet learns a dilation rate for each pixel via a 3-layer structure to fit receptive fields to objects of varying sizes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adding a 3-layer convolution structure to the network, ASCNet learns per-pixel dilation rates during training that create optimal receptive fields matched to each object's size, enabling more accurate semantic segmentation than either classic CNNs or fixed-rate dilated CNNs on the tested medical datasets.
What carries the argument
The 3-layer convolution structure that outputs per-pixel adaptive dilation rates.
If this is right
- ASCNet achieves the highest segmentation accuracy on the Herlev and SCD RBC datasets.
- The automatically generated dilation rates increase with object size.
- The approach extracts multi-scale information without fixed rates or extra computational cost from kernel expansion.
- Pixel-level rates avoid the information loss that comes from maximum pooling.
Where Pith is reading between the lines
- The same per-pixel adaptation idea could be tested on natural-image datasets that contain large scale variation.
- If the learned rates prove stable across domains, the method might reduce the need for explicit multi-scale modules such as feature pyramids.
- One could measure whether the 3-layer module adds measurable latency at inference time on the same hardware used for the original experiments.
Load-bearing premise
The 3-layer structure can learn per-pixel dilation rates that generalize beyond the training set and that the positive correlation with object size demonstrates the rates are effective rather than an artifact.
What would settle it
If additional datasets show no accuracy gain over fixed-rate dilated networks or no positive correlation between learned rates and measured object sizes, the central claim would be challenged.
Figures
read the original abstract
Extracting multi-scale information is key to semantic segmentation. However, the classic convolutional neural networks (CNNs) encounter difficulties in achieving multi-scale information extraction: expanding convolutional kernel incurs the high computational cost and using maximum pooling sacrifices image information. The recently developed dilated convolution solves these problems, but with the limitation that the dilation rates are fixed and therefore the receptive field cannot fit for all objects with different sizes in the image. We propose an adaptivescale convolutional neural network (ASCNet), which introduces a 3-layer convolution structure in the end-to-end training, to adaptively learn an appropriate dilation rate for each pixel in the image. Such pixel-level dilation rates produce optimal receptive fields so that the information of objects with different sizes can be extracted at the corresponding scale. We compare the segmentation results using the classic CNN, the dilated CNN and the proposed ASCNet on two types of medical images (The Herlev dataset and SCD RBC dataset). The experimental results show that ASCNet achieves the highest accuracy. Moreover, the automatically generated dilation rates are positively correlated to the sizes of the objects, confirming the effectiveness of the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ASCNet, which augments standard CNNs for semantic segmentation with a 3-layer convolution module that learns per-pixel dilation rates in an end-to-end fashion. This is intended to produce object-size-appropriate receptive fields without the cost of large kernels or the information loss of pooling. Experiments on the Herlev and SCD RBC medical-image datasets are reported to show that ASCNet attains the highest accuracy among compared methods (classic CNN, fixed-dilation CNN) and that the learned dilation rates correlate positively with object size, which the authors interpret as confirmation of the method's effectiveness.
Significance. If the per-pixel adaptation demonstrably improves segmentation beyond fixed-dilation baselines on varied object scales, the approach would offer a practical route to multi-scale feature extraction in domains such as medical imaging. The absence of reported quantitative metrics, ablations, or statistical tests in the provided abstract, however, leaves the magnitude and robustness of any gain unclear.
major comments (2)
- [Abstract] Abstract: the claim that 'ASCNet achieves the highest accuracy' is presented without any numerical values, standard deviations, or statistical tests, preventing assessment of whether the improvement is practically meaningful or merely within noise.
- [Abstract] Abstract: the assertion that the positive correlation between learned dilation rates and object sizes 'confirm[s] the effectiveness of the proposed method' is load-bearing for the central contribution, yet no ablation (fixed-rate baseline, shuffled-size control, or out-of-distribution size test) is described; without such controls the correlation could arise from the network regressing to dataset-wide scale statistics rather than from useful per-pixel adaptation at test time.
Simulated Author's Rebuttal
We thank the referee for these focused comments on the abstract. We will revise the abstract to include quantitative results from the experiments. For the correlation claim, we note that the existing fixed-dilation baseline already provides relevant evidence, though we acknowledge the value of additional controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'ASCNet achieves the highest accuracy' is presented without any numerical values, standard deviations, or statistical tests, preventing assessment of whether the improvement is practically meaningful or merely within noise.
Authors: We agree that the abstract would benefit from quantitative support. The revised abstract will report the specific segmentation accuracy metrics (e.g., Dice or IoU scores) for ASCNet versus the classic CNN and fixed-dilation CNN on both the Herlev and SCD RBC datasets. Where multiple runs exist in the full experiments, standard deviations will be added. revision: yes
-
Referee: [Abstract] Abstract: the assertion that the positive correlation between learned dilation rates and object sizes 'confirm[s] the effectiveness of the proposed method' is load-bearing for the central contribution, yet no ablation (fixed-rate baseline, shuffled-size control, or out-of-distribution size test) is described; without such controls the correlation could arise from the network regressing to dataset-wide scale statistics rather than from useful per-pixel adaptation at test time.
Authors: The manuscript already compares ASCNet against a fixed-dilation CNN, which serves as the fixed-rate baseline and demonstrates performance gains from adaptation. The positive correlation is shown in the results section as corroborating evidence. Shuffled-size controls and out-of-distribution size tests were not performed; the current fixed-baseline comparison and end-to-end training provide the primary support that adaptation is task-relevant rather than purely statistical. We will clarify this distinction in the revised text. revision: partial
Circularity Check
No circularity; empirical results do not reduce to inputs by construction
full rationale
The paper introduces a 3-layer convolution module to predict per-pixel dilation rates, trains the network end-to-end on segmentation tasks, and reports higher accuracy plus a post-training correlation between learned rates and object sizes. No equations, uniqueness theorems, or self-citations are invoked to derive the accuracy gain or the correlation; both are presented as measured outcomes on held-out test images. The correlation is offered as supporting observation rather than a quantity that is definitionally forced by the fitting procedure, and no load-bearing claim collapses to a renamed input or self-referential ansatz.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)
work page 2018
-
[2]
In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., Hikosaka, S.: Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1442–1450. IEEE (2018)
work page 2018
-
[3]
Nature inspired Smart Information Systems (NiSIS 2005) pp
Jantzen, J., Norup, J., Dounias, G., Bjerregaard, B.: Pap-smear benchmark data for pattern classification. Nature inspired Smart Information Systems (NiSIS 2005) pp. 1–9 (2005)
work page 2005
-
[4]
Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense transformer networks. arXiv preprint arXiv:1705.08881 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)
work page 2018
-
[6]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
work page 2015
-
[7]
In: Proceedings of the European Conference on Computer Vision (ECCV)
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 552–568 (2018)
work page 2018
-
[8]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)
work page 2017
-
[9]
In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP)
Shi, W., Jiang, F., Zhao, D.: Single image super-resolution with dilated convolution based multi-scale information learning inception module. In: 2017 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 977–981. IEEE (2017)
work page 2017
-
[10]
In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Under- standing convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1451–1460. IEEE (2018) 9
work page 2018
-
[11]
Multi-Scale Context Aggregation by Dilated Convolutions
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 472–480 (2017)
work page 2017
-
[13]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Zhang, M., Li, X., Xu, M., Li, Q.: Rbc semantic segmentation for sickle cell dis- ease based on deformable u-net. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 695–702. Springer (2018)
work page 2018
-
[14]
Zhao, J., Li, Q., Li, H., Zhang, L.: Automated segmentation of cervical nuclei in pap smear images using deformable multi-path ensemble model. arXiv preprint arXiv:1812.00527 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.