Unifying Convolution and Attention via Convolutional Nearest Neighbors
read the original abstract
Convolutional Neural Networks and Vision Transformers are the two dominant architectural families in computer vision, defined by spatially local convolution and global self-attention respectively. Despite their apparent differences, we show that both operations are special cases of a single $k$-nearest neighbor aggregation framework: convolution selects neighbors by spatial proximity while attention selects by feature similarity, placing them at two ends of a shared operational spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that exactly recovers standard and depthwise convolution, self-attention, and sparse attention variants including KVT-attention as special cases, and exposes the design space of neighbor-selection strategies between them through configurable similarity functions, positional encodings, and aggregation kernels. We validate ConvNN on ImageNet-1K classification across two complementary architectures: a hybrid branching layer in ResNet-50 that combines local and global feature learning, improving top-1 accuracy by 3.0% over the ResNet-50 baseline, and ConvNN-attention in ViT-Base that achieves 81.64% top-1 accuracy, surpassing standard multi-head self-attention by 0.7%. Together, these results demonstrate that ConvNN provides a principled foundation for designing operations that bridge convolutional and attention-based computation.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions
Multiprobe grid ANN maintains roughly constant d-scaling on GloVe while graph/tree/partitioning methods degrade, with near-linear N scaling and lower indexing cost.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.