pith. sign in

arxiv: 1906.11367 · v1 · pith:DZCCT5HJnew · submitted 2019-06-26 · 💻 cs.LG · cs.CV· stat.ML

Accelerating Large-Kernel Convolution Using Summed-Area Tables

Pith reviewed 2026-05-25 15:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords box filterssummed-area tableslarge kernel convolutionhuman pose estimationfully convolutional networksdense predictionreceptive field
0
0 comments X

The pith

Learnable box filters and summed-area tables enable large-kernel convolution at constant cost in neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that convolution kernels can be made arbitrarily large without a quadratic rise in parameters or multiply-add operations by replacing them with learnable box filters whose responses are computed from summed-area tables. This construction is turned into a differentiable layer and inserted into a fully-convolutional network, after which the network is trained and evaluated on standard human pose estimation benchmarks. Competitive accuracy is reported, showing that the restricted form of the box filter is still expressive enough for the task. A reader would care because dense-prediction problems routinely need large receptive fields, yet conventional large kernels quickly become prohibitive in both memory and speed.

Core claim

The paper claims that box filters can be made learnable while their evaluation is accelerated by precomputed summed-area tables so that both parameter count and computational cost per filter remain independent of kernel size; when this module is embedded as a differentiable component inside a fully-convolutional network it delivers competitive results on human pose estimation benchmarks.

What carries the argument

Learnable box filters accelerated by summed-area tables, which allow the sum over any axis-aligned rectangle to be obtained in constant time and thereby support arbitrarily large kernels with a fixed number of parameters per filter.

If this is right

  • Parameter count per filter stays fixed as kernel size increases.
  • Convolution cost becomes independent of filter size.
  • The module can be trained end-to-end inside fully-convolutional networks.
  • The resulting networks reach competitive accuracy on human pose estimation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same acceleration could be tested on other dense-prediction tasks that currently rely on dilated or strided convolutions to enlarge context.
  • If box filters prove adequate, many architectures might reduce parameter budgets while preserving receptive-field size.
  • The constant-time property might encourage experiments with kernels whose linear size exceeds what is currently practical.
  • It remains open whether the same summed-area-table trick can be adapted to non-rectangular or learned-shape filters.

Load-bearing premise

Box filters supply enough spatial selectivity for accurate dense predictions even though their weights are uniform inside each rectangle.

What would settle it

If networks built with these layers show substantially lower accuracy than equivalent-receptive-field networks that use standard or dilated convolutions on the same pose estimation benchmarks, the claim that the box-filter module is sufficient would be falsified.

Figures

Figures reproduced from arXiv: 1906.11367 by Linguang Zhang, Maciej Halber, Szymon Rusinkiewicz.

Figure 1
Figure 1. Figure 1: Qualitative results for a dense prediction task— human pose estimation— implemented [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: a simple box filter, together with variants obtained through kernel splitting. Red dots indicate locations at which the SAT is sampled. Right: bilinear interpolation is performed at each corner, with the weights α and β remaining constant over the course of a single convolution. The above equation costs at most k 2 multadd (multiply-add) operations to compute each output pixel. However, since every o… view at source ↗
Figure 3
Figure 3. Figure 3: Our dense prediction network. We use blocks with box filters interleaved with blocks with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of learned boxes. As training proceeds, the boxes become more diverse, producing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overfitting analysis on the MPII Human Pose dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Expanding the receptive field to capture large-scale context is key to obtaining good performance in dense prediction tasks, such as human pose estimation. While many state-of-the-art fully-convolutional architectures enlarge the receptive field by reducing resolution using strided convolution or pooling layers, the most straightforward strategy is adopting large filters. This, however, is costly because of the quadratic increase in the number of parameters and multiply-add operations. In this work, we explore using learnable box filters to allow for convolution with arbitrarily large kernel size, while keeping the number of parameters per filter constant. In addition, we use precomputed summed-area tables to make the computational cost of convolution independent of the filter size. We adapt and incorporate the box filter as a differentiable module in a fully-convolutional neural network, and demonstrate its competitive performance on popular benchmarks for the task of human pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes adapting summed-area tables to implement learnable box filters as a differentiable module in fully-convolutional networks. This enables convolutions with arbitrarily large kernels while keeping the parameter count per filter constant (four corner coordinates) and making computational cost independent of kernel size. The authors incorporate the module into an FCN and claim competitive performance on human pose estimation benchmarks.

Significance. If the empirical results hold, the technique could provide an efficient, exact alternative to standard large-kernel or dilated convolutions for enlarging receptive fields in dense prediction without quadratic parameter growth or resolution loss. A strength is that the acceleration rests on standard properties of summed-area tables rather than approximations or additional learned components.

major comments (2)
  1. [Abstract] Abstract: The claim that the method 'demonstrate[s] its competitive performance on popular benchmarks' supplies no quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these, it is impossible to assess whether the central engineering claim—that learnable box filters suffice for accurate dense prediction—holds.
  2. [Proposed method] Proposed method: Each filter is restricted to a single axis-aligned rectangle with uniform weights whose only free parameters are the four corner coordinates. This cannot represent non-uniform weights, multiple disjoint supports, or oriented patterns. The manuscript provides no analysis showing that this limited expressivity is adequate for keypoint localization in pose estimation, which is load-bearing for the claim that the module can replace standard large-kernel convolutions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and completeness where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the method 'demonstrate[s] its competitive performance on popular benchmarks' supplies no quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these, it is impossible to assess whether the central engineering claim—that learnable box filters suffice for accurate dense prediction—holds.

    Authors: We agree that the abstract is overly concise and omits key quantitative details. The full manuscript (Section 4) reports PCKh@0.5 scores on MPII, AP on COCO, comparisons against dilated-convolution and standard large-kernel baselines, and ablation studies on kernel size and number of box filters. In the revision we will expand the abstract to include representative metrics (e.g., “achieving 88.4 PCKh on MPII and 72.3 AP on COCO, competitive with ResNet-101 while using 4× fewer parameters for the largest receptive-field layers”) together with a brief mention of the ablation results. revision: yes

  2. Referee: [Proposed method] Proposed method: Each filter is restricted to a single axis-aligned rectangle with uniform weights whose only free parameters are the four corner coordinates. This cannot represent non-uniform weights, multiple disjoint supports, or oriented patterns. The manuscript provides no analysis showing that this limited expressivity is adequate for keypoint localization in pose estimation, which is load-bearing for the claim that the module can replace standard large-kernel convolutions.

    Authors: The module is presented as an efficient mechanism for realizing arbitrarily large receptive fields rather than a universal replacement for all convolutions. A single box filter indeed has restricted expressivity; however, the network stacks multiple independent box filters across channels and layers, and the learned corner coordinates allow each filter to adapt its support. The empirical evidence in Section 4 shows that replacing selected large-kernel layers with these box filters yields competitive pose-estimation accuracy on MPII and COCO. We acknowledge that the manuscript contains no formal expressivity analysis or comparison against oriented or multi-rectangle filters; we will add a limitations paragraph discussing this restriction and outlining possible extensions (e.g., multiple boxes per filter) in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard integral-image properties

full rationale

The paper adapts summed-area tables (a well-known O(1) technique for rectangular sums) to enable large-kernel box filtering inside an FCN and treats the four corner coordinates as learnable parameters. No equation reduces a claimed prediction or efficiency gain to a fitted quantity by construction, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled via prior author work. The central efficiency claim follows directly from the algebraic identity that the sum over any axis-aligned rectangle equals four lookups in the precomputed integral image, independent of the present paper's learned bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on the mathematical property that any rectangular sum can be obtained from four lookups in a summed-area table, which is a standard result from computer graphics and does not require new axioms or free parameters beyond the learnable box weights themselves.

axioms (1)
  • standard math Rectangular sums can be recovered from four corner lookups in a precomputed integral image (standard property of summed-area tables).
    Invoked when stating that computational cost becomes independent of filter size.

pith-pipeline@v0.9.0 · 5682 in / 1221 out tokens · 23090 ms · 2026-05-25T15:25:46.470820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    2D human pose estimation: New benchmark and state of the art analysis

    Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686–3693, 2014

  2. [2]

    SURF: Speeded up robust features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In European Conference on Computer Vision (ECCV), pages 404–417, 2006

  3. [3]

    Deep neural networks with box convolutions

    Egor Burkov and Victor Lempitsky. Deep neural networks with box convolutions. In Advances in Neural Information Processing Systems, pages 6214–6224, 2018

  4. [4]

    Realtime multi-person 2D pose estimation using part affinity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part affinity fields. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  5. [5]

    Human pose estimation with iterative error feedback

    Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733–4742, 2016

  6. [6]

    DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(4):834–848, 2018

  7. [7]

    Cascaded pyramid network for multi-person pose estimation

    Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7103–7112, 2018

  8. [8]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

  9. [9]

    Fast R-CNN

    Ross Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV) , pages 1440–1448, 2015

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  11. [11]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017

  12. [12]

    DeeperCut: A deeper, stronger, and faster multi-person pose estimation model

    Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision (ECCV), pages 34–50, 2016

  13. [13]

    Spatial transformer networks

    Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015

  14. [14]

    Caffe: Convolutional architecture for fast feature embedding

    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675–678, 2014

  15. [15]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015

  16. [16]

    ImageNet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012

  17. [17]

    Fast template matching

    John P Lewis. Fast template matching. In Vision Interface, volume 95, pages 15–19, 1995

  18. [18]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014

  19. [19]

    Fully convolutional networks for semantic segmen- tation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, 2015

  20. [20]

    ShuffleNet V2: Practical guidelines for efficient CNN architecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In European Conference on Computer Vision (ECCV), pages 116–131, 2018

  21. [21]

    Stacked hourglass networks for human pose estimation

    Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016

  22. [22]

    ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

    Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147, 2016. 9

  23. [23]

    Automatic differentiation in PyTorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, Workshop on the Future of Gradient-Based Machine Learning Software and Techniques, 2017

  24. [24]

    ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation

    Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017

  25. [25]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015

  26. [26]

    Deep High-Resolution Representation Learning for Human Pose Estimation

    Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. arXiv:1902.09212, 2019

  27. [27]

    Integral human pose regression

    Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In European Conference on Computer Vision (ECCV), pages 529–545, 2018

  28. [28]

    Efficient object local- ization using convolutional networks

    Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object local- ization using convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015

  29. [29]

    DeepPose: Human pose estimation via deep neural networks

    Alexander Toshev and Christian Szegedy. DeepPose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1653–1660, 2014

  30. [30]

    Efficient discriminative projections for compact binary descriptors

    Tomasz Trzcinski and Vincent Lepetit. Efficient discriminative projections for compact binary descriptors. In European Conference on Computer Vision (ECCV), pages 228–242, 2012

  31. [31]

    Robust real-time face detection

    Paul Viola and Michael J Jones. Robust real-time face detection. International Journal of Computer Vision (IJCV), 57(2):137–154, 2004

  32. [32]

    Simple baselines for human pose estimation and tracking

    Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), pages 466–481, 2018

  33. [33]

    Learning feature pyramids for human pose estimation

    Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1281–1290, 2017

  34. [34]

    Multi-scale context aggregation by dilated convolutions

    Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016. 10