Accelerating Large-Kernel Convolution Using Summed-Area Tables
Pith reviewed 2026-05-25 15:25 UTC · model grok-4.3
The pith
Learnable box filters and summed-area tables enable large-kernel convolution at constant cost in neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that box filters can be made learnable while their evaluation is accelerated by precomputed summed-area tables so that both parameter count and computational cost per filter remain independent of kernel size; when this module is embedded as a differentiable component inside a fully-convolutional network it delivers competitive results on human pose estimation benchmarks.
What carries the argument
Learnable box filters accelerated by summed-area tables, which allow the sum over any axis-aligned rectangle to be obtained in constant time and thereby support arbitrarily large kernels with a fixed number of parameters per filter.
If this is right
- Parameter count per filter stays fixed as kernel size increases.
- Convolution cost becomes independent of filter size.
- The module can be trained end-to-end inside fully-convolutional networks.
- The resulting networks reach competitive accuracy on human pose estimation benchmarks.
Where Pith is reading between the lines
- The same acceleration could be tested on other dense-prediction tasks that currently rely on dilated or strided convolutions to enlarge context.
- If box filters prove adequate, many architectures might reduce parameter budgets while preserving receptive-field size.
- The constant-time property might encourage experiments with kernels whose linear size exceeds what is currently practical.
- It remains open whether the same summed-area-table trick can be adapted to non-rectangular or learned-shape filters.
Load-bearing premise
Box filters supply enough spatial selectivity for accurate dense predictions even though their weights are uniform inside each rectangle.
What would settle it
If networks built with these layers show substantially lower accuracy than equivalent-receptive-field networks that use standard or dilated convolutions on the same pose estimation benchmarks, the claim that the box-filter module is sufficient would be falsified.
Figures
read the original abstract
Expanding the receptive field to capture large-scale context is key to obtaining good performance in dense prediction tasks, such as human pose estimation. While many state-of-the-art fully-convolutional architectures enlarge the receptive field by reducing resolution using strided convolution or pooling layers, the most straightforward strategy is adopting large filters. This, however, is costly because of the quadratic increase in the number of parameters and multiply-add operations. In this work, we explore using learnable box filters to allow for convolution with arbitrarily large kernel size, while keeping the number of parameters per filter constant. In addition, we use precomputed summed-area tables to make the computational cost of convolution independent of the filter size. We adapt and incorporate the box filter as a differentiable module in a fully-convolutional neural network, and demonstrate its competitive performance on popular benchmarks for the task of human pose estimation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes adapting summed-area tables to implement learnable box filters as a differentiable module in fully-convolutional networks. This enables convolutions with arbitrarily large kernels while keeping the parameter count per filter constant (four corner coordinates) and making computational cost independent of kernel size. The authors incorporate the module into an FCN and claim competitive performance on human pose estimation benchmarks.
Significance. If the empirical results hold, the technique could provide an efficient, exact alternative to standard large-kernel or dilated convolutions for enlarging receptive fields in dense prediction without quadratic parameter growth or resolution loss. A strength is that the acceleration rests on standard properties of summed-area tables rather than approximations or additional learned components.
major comments (2)
- [Abstract] Abstract: The claim that the method 'demonstrate[s] its competitive performance on popular benchmarks' supplies no quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these, it is impossible to assess whether the central engineering claim—that learnable box filters suffice for accurate dense prediction—holds.
- [Proposed method] Proposed method: Each filter is restricted to a single axis-aligned rectangle with uniform weights whose only free parameters are the four corner coordinates. This cannot represent non-uniform weights, multiple disjoint supports, or oriented patterns. The manuscript provides no analysis showing that this limited expressivity is adequate for keypoint localization in pose estimation, which is load-bearing for the claim that the module can replace standard large-kernel convolutions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and completeness where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the method 'demonstrate[s] its competitive performance on popular benchmarks' supplies no quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these, it is impossible to assess whether the central engineering claim—that learnable box filters suffice for accurate dense prediction—holds.
Authors: We agree that the abstract is overly concise and omits key quantitative details. The full manuscript (Section 4) reports PCKh@0.5 scores on MPII, AP on COCO, comparisons against dilated-convolution and standard large-kernel baselines, and ablation studies on kernel size and number of box filters. In the revision we will expand the abstract to include representative metrics (e.g., “achieving 88.4 PCKh on MPII and 72.3 AP on COCO, competitive with ResNet-101 while using 4× fewer parameters for the largest receptive-field layers”) together with a brief mention of the ablation results. revision: yes
-
Referee: [Proposed method] Proposed method: Each filter is restricted to a single axis-aligned rectangle with uniform weights whose only free parameters are the four corner coordinates. This cannot represent non-uniform weights, multiple disjoint supports, or oriented patterns. The manuscript provides no analysis showing that this limited expressivity is adequate for keypoint localization in pose estimation, which is load-bearing for the claim that the module can replace standard large-kernel convolutions.
Authors: The module is presented as an efficient mechanism for realizing arbitrarily large receptive fields rather than a universal replacement for all convolutions. A single box filter indeed has restricted expressivity; however, the network stacks multiple independent box filters across channels and layers, and the learned corner coordinates allow each filter to adapt its support. The empirical evidence in Section 4 shows that replacing selected large-kernel layers with these box filters yields competitive pose-estimation accuracy on MPII and COCO. We acknowledge that the manuscript contains no formal expressivity analysis or comparison against oriented or multi-rectangle filters; we will add a limitations paragraph discussing this restriction and outlining possible extensions (e.g., multiple boxes per filter) in the revised version. revision: partial
Circularity Check
No significant circularity; derivation relies on standard integral-image properties
full rationale
The paper adapts summed-area tables (a well-known O(1) technique for rectangular sums) to enable large-kernel box filtering inside an FCN and treats the four corner coordinates as learnable parameters. No equation reduces a claimed prediction or efficiency gain to a fitted quantity by construction, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled via prior author work. The central efficiency claim follows directly from the algebraic identity that the sum over any axis-aligned rectangle equals four lookups in the precomputed integral image, independent of the present paper's learned bounds.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Rectangular sums can be recovered from four corner lookups in a precomputed integral image (standard property of summed-area tables).
Reference graph
Works this paper leans on
-
[1]
2D human pose estimation: New benchmark and state of the art analysis
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686–3693, 2014
work page 2014
-
[2]
SURF: Speeded up robust features
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In European Conference on Computer Vision (ECCV), pages 404–417, 2006
work page 2006
-
[3]
Deep neural networks with box convolutions
Egor Burkov and Victor Lempitsky. Deep neural networks with box convolutions. In Advances in Neural Information Processing Systems, pages 6214–6224, 2018
work page 2018
-
[4]
Realtime multi-person 2D pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part affinity fields. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
work page 2017
-
[5]
Human pose estimation with iterative error feedback
Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733–4742, 2016
work page 2016
-
[6]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(4):834–848, 2018
work page 2018
-
[7]
Cascaded pyramid network for multi-person pose estimation
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7103–7112, 2018
work page 2018
-
[8]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009
work page 2009
-
[9]
Ross Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV) , pages 1440–1448, 2015
work page 2015
-
[10]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016
work page 2016
-
[11]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017
work page 2017
-
[12]
DeeperCut: A deeper, stronger, and faster multi-person pose estimation model
Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision (ECCV), pages 34–50, 2016
work page 2016
-
[13]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015
work page 2017
-
[14]
Caffe: Convolutional architecture for fast feature embedding
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675–678, 2014
work page 2014
-
[15]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015
work page 2015
-
[16]
ImageNet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012
work page 2012
-
[17]
John P Lewis. Fast template matching. In Vision Interface, volume 95, pages 15–19, 1995
work page 1995
-
[18]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014
work page 2014
-
[19]
Fully convolutional networks for semantic segmen- tation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, 2015
work page 2015
-
[20]
ShuffleNet V2: Practical guidelines for efficient CNN architecture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In European Conference on Computer Vision (ECCV), pages 116–131, 2018
work page 2018
-
[21]
Stacked hourglass networks for human pose estimation
Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016
work page 2016
-
[22]
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147, 2016. 9
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Automatic differentiation in PyTorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, Workshop on the Future of Gradient-Based Machine Learning Software and Techniques, 2017
work page 2017
-
[24]
ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation
Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017
work page 2017
-
[25]
U-Net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015
work page 2015
-
[26]
Deep High-Resolution Representation Learning for Human Pose Estimation
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. arXiv:1902.09212, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[27]
Integral human pose regression
Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In European Conference on Computer Vision (ECCV), pages 529–545, 2018
work page 2018
-
[28]
Efficient object local- ization using convolutional networks
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object local- ization using convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015
work page 2015
-
[29]
DeepPose: Human pose estimation via deep neural networks
Alexander Toshev and Christian Szegedy. DeepPose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1653–1660, 2014
work page 2014
-
[30]
Efficient discriminative projections for compact binary descriptors
Tomasz Trzcinski and Vincent Lepetit. Efficient discriminative projections for compact binary descriptors. In European Conference on Computer Vision (ECCV), pages 228–242, 2012
work page 2012
-
[31]
Robust real-time face detection
Paul Viola and Michael J Jones. Robust real-time face detection. International Journal of Computer Vision (IJCV), 57(2):137–154, 2004
work page 2004
-
[32]
Simple baselines for human pose estimation and tracking
Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), pages 466–481, 2018
work page 2018
-
[33]
Learning feature pyramids for human pose estimation
Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1281–1290, 2017
work page 2017
-
[34]
Multi-scale context aggregation by dilated convolutions
Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016. 10
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.