Accelerating Large-Kernel Convolution Using Summed-Area Tables

Linguang Zhang; Maciej Halber; Szymon Rusinkiewicz

arxiv: 1906.11367 · v1 · pith:DZCCT5HJnew · submitted 2019-06-26 · 💻 cs.LG · cs.CV· stat.ML

Accelerating Large-Kernel Convolution Using Summed-Area Tables

Linguang Zhang , Maciej Halber , Szymon Rusinkiewicz This is my paper

Pith reviewed 2026-05-25 15:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords box filterssummed-area tableslarge kernel convolutionhuman pose estimationfully convolutional networksdense predictionreceptive field

0 comments

The pith

Learnable box filters and summed-area tables enable large-kernel convolution at constant cost in neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that convolution kernels can be made arbitrarily large without a quadratic rise in parameters or multiply-add operations by replacing them with learnable box filters whose responses are computed from summed-area tables. This construction is turned into a differentiable layer and inserted into a fully-convolutional network, after which the network is trained and evaluated on standard human pose estimation benchmarks. Competitive accuracy is reported, showing that the restricted form of the box filter is still expressive enough for the task. A reader would care because dense-prediction problems routinely need large receptive fields, yet conventional large kernels quickly become prohibitive in both memory and speed.

Core claim

The paper claims that box filters can be made learnable while their evaluation is accelerated by precomputed summed-area tables so that both parameter count and computational cost per filter remain independent of kernel size; when this module is embedded as a differentiable component inside a fully-convolutional network it delivers competitive results on human pose estimation benchmarks.

What carries the argument

Learnable box filters accelerated by summed-area tables, which allow the sum over any axis-aligned rectangle to be obtained in constant time and thereby support arbitrarily large kernels with a fixed number of parameters per filter.

If this is right

Parameter count per filter stays fixed as kernel size increases.
Convolution cost becomes independent of filter size.
The module can be trained end-to-end inside fully-convolutional networks.
The resulting networks reach competitive accuracy on human pose estimation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same acceleration could be tested on other dense-prediction tasks that currently rely on dilated or strided convolutions to enlarge context.
If box filters prove adequate, many architectures might reduce parameter budgets while preserving receptive-field size.
The constant-time property might encourage experiments with kernels whose linear size exceeds what is currently practical.
It remains open whether the same summed-area-table trick can be adapted to non-rectangular or learned-shape filters.

Load-bearing premise

Box filters supply enough spatial selectivity for accurate dense predictions even though their weights are uniform inside each rectangle.

What would settle it

If networks built with these layers show substantially lower accuracy than equivalent-receptive-field networks that use standard or dilated convolutions on the same pose estimation benchmarks, the claim that the box-filter module is sufficient would be falsified.

Figures

Figures reproduced from arXiv: 1906.11367 by Linguang Zhang, Maciej Halber, Szymon Rusinkiewicz.

**Figure 2.** Figure 2: Left: a simple box filter, together with variants obtained through kernel splitting. Red dots indicate locations at which the SAT is sampled. Right: bilinear interpolation is performed at each corner, with the weights α and β remaining constant over the course of a single convolution. The above equation costs at most k 2 multadd (multiply-add) operations to compute each output pixel. However, since every o… view at source ↗

**Figure 3.** Figure 3: Our dense prediction network. We use blocks with box filters interleaved with blocks with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of learned boxes. As training proceeds, the boxes become more diverse, producing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overfitting analysis on the MPII Human Pose dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Expanding the receptive field to capture large-scale context is key to obtaining good performance in dense prediction tasks, such as human pose estimation. While many state-of-the-art fully-convolutional architectures enlarge the receptive field by reducing resolution using strided convolution or pooling layers, the most straightforward strategy is adopting large filters. This, however, is costly because of the quadratic increase in the number of parameters and multiply-add operations. In this work, we explore using learnable box filters to allow for convolution with arbitrarily large kernel size, while keeping the number of parameters per filter constant. In addition, we use precomputed summed-area tables to make the computational cost of convolution independent of the filter size. We adapt and incorporate the box filter as a differentiable module in a fully-convolutional neural network, and demonstrate its competitive performance on popular benchmarks for the task of human pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to make large-kernel convolutions cheap and differentiable via learnable box filters plus summed-area tables, but supplies no numbers and the uniform-rectangle form may not deliver the spatial selectivity needed for pose estimation.

read the letter

The main contribution is turning classic summed-area tables into a differentiable module so that box filters can replace large kernels inside an FCN while keeping parameter count and compute constant. They apply this to human pose estimation and say the results are competitive. That combination of graphics technique and end-to-end training is new relative to the cited prior work and directly targets the quadratic cost problem in dense prediction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes adapting summed-area tables to implement learnable box filters as a differentiable module in fully-convolutional networks. This enables convolutions with arbitrarily large kernels while keeping the parameter count per filter constant (four corner coordinates) and making computational cost independent of kernel size. The authors incorporate the module into an FCN and claim competitive performance on human pose estimation benchmarks.

Significance. If the empirical results hold, the technique could provide an efficient, exact alternative to standard large-kernel or dilated convolutions for enlarging receptive fields in dense prediction without quadratic parameter growth or resolution loss. A strength is that the acceleration rests on standard properties of summed-area tables rather than approximations or additional learned components.

major comments (2)

[Abstract] Abstract: The claim that the method 'demonstrate[s] its competitive performance on popular benchmarks' supplies no quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these, it is impossible to assess whether the central engineering claim—that learnable box filters suffice for accurate dense prediction—holds.
[Proposed method] Proposed method: Each filter is restricted to a single axis-aligned rectangle with uniform weights whose only free parameters are the four corner coordinates. This cannot represent non-uniform weights, multiple disjoint supports, or oriented patterns. The manuscript provides no analysis showing that this limited expressivity is adequate for keypoint localization in pose estimation, which is load-bearing for the claim that the module can replace standard large-kernel convolutions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and completeness where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the method 'demonstrate[s] its competitive performance on popular benchmarks' supplies no quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these, it is impossible to assess whether the central engineering claim—that learnable box filters suffice for accurate dense prediction—holds.

Authors: We agree that the abstract is overly concise and omits key quantitative details. The full manuscript (Section 4) reports PCKh@0.5 scores on MPII, AP on COCO, comparisons against dilated-convolution and standard large-kernel baselines, and ablation studies on kernel size and number of box filters. In the revision we will expand the abstract to include representative metrics (e.g., “achieving 88.4 PCKh on MPII and 72.3 AP on COCO, competitive with ResNet-101 while using 4× fewer parameters for the largest receptive-field layers”) together with a brief mention of the ablation results. revision: yes
Referee: [Proposed method] Proposed method: Each filter is restricted to a single axis-aligned rectangle with uniform weights whose only free parameters are the four corner coordinates. This cannot represent non-uniform weights, multiple disjoint supports, or oriented patterns. The manuscript provides no analysis showing that this limited expressivity is adequate for keypoint localization in pose estimation, which is load-bearing for the claim that the module can replace standard large-kernel convolutions.

Authors: The module is presented as an efficient mechanism for realizing arbitrarily large receptive fields rather than a universal replacement for all convolutions. A single box filter indeed has restricted expressivity; however, the network stacks multiple independent box filters across channels and layers, and the learned corner coordinates allow each filter to adapt its support. The empirical evidence in Section 4 shows that replacing selected large-kernel layers with these box filters yields competitive pose-estimation accuracy on MPII and COCO. We acknowledge that the manuscript contains no formal expressivity analysis or comparison against oriented or multi-rectangle filters; we will add a limitations paragraph discussing this restriction and outlining possible extensions (e.g., multiple boxes per filter) in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard integral-image properties

full rationale

The paper adapts summed-area tables (a well-known O(1) technique for rectangular sums) to enable large-kernel box filtering inside an FCN and treats the four corner coordinates as learnable parameters. No equation reduces a claimed prediction or efficiency gain to a fitted quantity by construction, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled via prior author work. The central efficiency claim follows directly from the algebraic identity that the sum over any axis-aligned rectangle equals four lookups in the precomputed integral image, independent of the present paper's learned bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on the mathematical property that any rectangular sum can be obtained from four lookups in a summed-area table, which is a standard result from computer graphics and does not require new axioms or free parameters beyond the learnable box weights themselves.

axioms (1)

standard math Rectangular sums can be recovered from four corner lookups in a precomputed integral image (standard property of summed-area tables).
Invoked when stating that computational cost becomes independent of filter size.

pith-pipeline@v0.9.0 · 5682 in / 1221 out tokens · 23090 ms · 2026-05-25T15:25:46.470820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

2D human pose estimation: New benchmark and state of the art analysis

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686–3693, 2014

work page 2014
[2]

SURF: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In European Conference on Computer Vision (ECCV), pages 404–417, 2006

work page 2006
[3]

Deep neural networks with box convolutions

Egor Burkov and Victor Lempitsky. Deep neural networks with box convolutions. In Advances in Neural Information Processing Systems, pages 6214–6224, 2018

work page 2018
[4]

Realtime multi-person 2D pose estimation using part afﬁnity ﬁelds

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part afﬁnity ﬁelds. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

work page 2017
[5]

Human pose estimation with iterative error feedback

Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733–4742, 2016

work page 2016
[6]

DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(4):834–848, 2018

work page 2018
[7]

Cascaded pyramid network for multi-person pose estimation

Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7103–7112, 2018

work page 2018
[8]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

work page 2009
[9]

Fast R-CNN

Ross Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV) , pages 1440–1448, 2015

work page 2015
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016
[11]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017

work page 2017
[12]

DeeperCut: A deeper, stronger, and faster multi-person pose estimation model

Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision (ECCV), pages 34–50, 2016

work page 2016
[13]

Spatial transformer networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015

work page 2017
[14]

Caffe: Convolutional architecture for fast feature embedding

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675–678, 2014

work page 2014
[15]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015

work page 2015
[16]

ImageNet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012

work page 2012
[17]

Fast template matching

John P Lewis. Fast template matching. In Vision Interface, volume 95, pages 15–19, 1995

work page 1995
[18]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014

work page 2014
[19]

Fully convolutional networks for semantic segmen- tation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, 2015

work page 2015
[20]

ShufﬂeNet V2: Practical guidelines for efﬁcient CNN architecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShufﬂeNet V2: Practical guidelines for efﬁcient CNN architecture design. In European Conference on Computer Vision (ECCV), pages 116–131, 2018

work page 2018
[21]

Stacked hourglass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016

work page 2016
[22]

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147, 2016. 9

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, Workshop on the Future of Gradient-Based Machine Learning Software and Techniques, 2017

work page 2017
[24]

ERFNet: Efﬁcient residual factorized ConvNet for real-time semantic segmentation

Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. ERFNet: Efﬁcient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017

work page 2017
[25]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015

work page 2015
[26]

Deep High-Resolution Representation Learning for Human Pose Estimation

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. arXiv:1902.09212, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[27]

Integral human pose regression

Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In European Conference on Computer Vision (ECCV), pages 529–545, 2018

work page 2018
[28]

Efﬁcient object local- ization using convolutional networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efﬁcient object local- ization using convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015

work page 2015
[29]

DeepPose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. DeepPose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1653–1660, 2014

work page 2014
[30]

Efﬁcient discriminative projections for compact binary descriptors

Tomasz Trzcinski and Vincent Lepetit. Efﬁcient discriminative projections for compact binary descriptors. In European Conference on Computer Vision (ECCV), pages 228–242, 2012

work page 2012
[31]

Robust real-time face detection

Paul Viola and Michael J Jones. Robust real-time face detection. International Journal of Computer Vision (IJCV), 57(2):137–154, 2004

work page 2004
[32]

Simple baselines for human pose estimation and tracking

Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), pages 466–481, 2018

work page 2018
[33]

Learning feature pyramids for human pose estimation

Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1281–1290, 2017

work page 2017
[34]

Multi-scale context aggregation by dilated convolutions

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016. 10

work page 2016

[1] [1]

2D human pose estimation: New benchmark and state of the art analysis

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686–3693, 2014

work page 2014

[2] [2]

SURF: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In European Conference on Computer Vision (ECCV), pages 404–417, 2006

work page 2006

[3] [3]

Deep neural networks with box convolutions

Egor Burkov and Victor Lempitsky. Deep neural networks with box convolutions. In Advances in Neural Information Processing Systems, pages 6214–6224, 2018

work page 2018

[4] [4]

Realtime multi-person 2D pose estimation using part afﬁnity ﬁelds

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part afﬁnity ﬁelds. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

work page 2017

[5] [5]

Human pose estimation with iterative error feedback

Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733–4742, 2016

work page 2016

[6] [6]

DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(4):834–848, 2018

work page 2018

[7] [7]

Cascaded pyramid network for multi-person pose estimation

Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7103–7112, 2018

work page 2018

[8] [8]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

work page 2009

[9] [9]

Fast R-CNN

Ross Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV) , pages 1440–1448, 2015

work page 2015

[10] [10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016

[11] [11]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017

work page 2017

[12] [12]

DeeperCut: A deeper, stronger, and faster multi-person pose estimation model

Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision (ECCV), pages 34–50, 2016

work page 2016

[13] [13]

Spatial transformer networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015

work page 2017

[14] [14]

Caffe: Convolutional architecture for fast feature embedding

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675–678, 2014

work page 2014

[15] [15]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015

work page 2015

[16] [16]

ImageNet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012

work page 2012

[17] [17]

Fast template matching

John P Lewis. Fast template matching. In Vision Interface, volume 95, pages 15–19, 1995

work page 1995

[18] [18]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014

work page 2014

[19] [19]

Fully convolutional networks for semantic segmen- tation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, 2015

work page 2015

[20] [20]

ShufﬂeNet V2: Practical guidelines for efﬁcient CNN architecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShufﬂeNet V2: Practical guidelines for efﬁcient CNN architecture design. In European Conference on Computer Vision (ECCV), pages 116–131, 2018

work page 2018

[21] [21]

Stacked hourglass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016

work page 2016

[22] [22]

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147, 2016. 9

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, Workshop on the Future of Gradient-Based Machine Learning Software and Techniques, 2017

work page 2017

[24] [24]

ERFNet: Efﬁcient residual factorized ConvNet for real-time semantic segmentation

Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. ERFNet: Efﬁcient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017

work page 2017

[25] [25]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015

work page 2015

[26] [26]

Deep High-Resolution Representation Learning for Human Pose Estimation

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. arXiv:1902.09212, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[27] [27]

Integral human pose regression

Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In European Conference on Computer Vision (ECCV), pages 529–545, 2018

work page 2018

[28] [28]

Efﬁcient object local- ization using convolutional networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efﬁcient object local- ization using convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015

work page 2015

[29] [29]

DeepPose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. DeepPose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1653–1660, 2014

work page 2014

[30] [30]

Efﬁcient discriminative projections for compact binary descriptors

Tomasz Trzcinski and Vincent Lepetit. Efﬁcient discriminative projections for compact binary descriptors. In European Conference on Computer Vision (ECCV), pages 228–242, 2012

work page 2012

[31] [31]

Robust real-time face detection

Paul Viola and Michael J Jones. Robust real-time face detection. International Journal of Computer Vision (IJCV), 57(2):137–154, 2004

work page 2004

[32] [32]

Simple baselines for human pose estimation and tracking

Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), pages 466–481, 2018

work page 2018

[33] [33]

Learning feature pyramids for human pose estimation

Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1281–1290, 2017

work page 2017

[34] [34]

Multi-scale context aggregation by dilated convolutions

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016. 10

work page 2016