pith. sign in

arxiv: 1807.03247 · v2 · pith:4SA7SRYEnew · submitted 2018-07-09 · 💻 cs.CV · cs.LG· stat.ML

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

classification 💻 cs.CV cs.LGstat.ML
keywords coordconvconvolutionnetworksproblemconvolutionalcoordinatetransformappropriate
0
0 comments X
read the original abstract

Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST showed 24% better IOU when using CoordConv, and in the RL domain agents playing Atari games benefit significantly from the use of CoordConv layers.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

    cs.LG 2026-05 unverdicted novelty 7.0

    Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.

  2. VitaminP: cross-modal learning enables whole-cell segmentation from routine histology

    cs.CV 2026-04 unverdicted novelty 7.0

    VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.

  3. Flexible SVBRDF Capture with a Multi-Image Deep Network

    cs.GR 2019-06 unverdicted novelty 6.0

    A deep network with an order-independent fusing layer estimates SVBRDF from variable numbers of uncalibrated handheld flash photos and improves with more inputs.

  4. Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions

    cs.CV 2019-06 unverdicted novelty 6.0

    A grid-based convolutional architecture fuses semantic maps and 3D perceptions to model driving interactions and predict future agent states, evaluated on a new industry-grade dataset.