Large Area 3D Human Pose Detection Via Stereo Reconstruction in Panoramic Cameras

Andreas Pichler; Christoph Heindl; Josef Scharinger; Thomas P\"onitz

arxiv: 1907.00534 · v1 · pith:RJC2TKHYnew · submitted 2019-07-01 · 💻 cs.CV

Large Area 3D Human Pose Detection Via Stereo Reconstruction in Panoramic Cameras

Christoph Heindl , Thomas P\"onitz , Andreas Pichler , Josef Scharinger This is my paper

Pith reviewed 2026-05-25 12:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D human pose estimationpanoramic camerasfisheye distortion correctionstereo reconstructionrectilinear projectiondeep learning pose estimationergonomic analysis

0 comments

The pith

Converting fisheye panoramic images to rectilinear views lets standard 2D pose estimators produce accurate 3D poses via stereo without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that panoramic camera images can be mapped to conventional flat perspectives so that existing deep-learning networks for 2D human pose estimation can be applied unchanged. These 2D detections from two cameras are then combined through stereo triangulation to recover 3D joint positions across a wide field of view. The approach targets applications such as ergonomic analysis where large-area coverage matters and where collecting new training data for distorted images would be expensive.

Core claim

Transforming fisheye perspectives to rectilinear views allows a direct application of two-dimensional deep-learning pose estimation methods without the explicit need for a costly re-training step to compensate for fisheye image distortions, enabling accurate 3D human pose estimation over large fields of view using two panoramic cameras.

What carries the argument

The undistortion step that maps panoramic fisheye images to rectilinear projections, allowing unmodified 2D pose networks to operate and their outputs to be triangulated in 3D.

If this is right

3D poses can be recovered over wide areas using only off-the-shelf 2D networks and standard stereo geometry.
Ergonomic and pose-based assessments become feasible without custom fisheye training datasets.
The method works with any two panoramic cameras whose relative pose is known.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same undistortion-plus-standard-network pattern could apply to other wide-angle or catadioptric camera systems in robotics or surveillance.
If the conversion step is made differentiable it might allow end-to-end fine-tuning while still starting from pretrained rectilinear weights.
Performance on fast motions or crowded scenes would still depend on the underlying 2D estimator's robustness.

Load-bearing premise

Mapping panoramic images to rectilinear views must keep geometric fidelity high enough that unmodified 2D pose estimators still output joint locations accurate enough for reliable stereo triangulation.

What would settle it

Running the pipeline on a calibrated test set of panoramic images with known ground-truth 3D poses and finding that the resulting 3D joint errors exceed those of a model retrained directly on fisheye data would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.00534 by Andreas Pichler, Christoph Heindl, Josef Scharinger, Thomas P\"onitz.

**Figure 4.** Figure 4: Upright rectilinear view generation. Fisheye input [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Various stages of the 2D pose estimation algorithm. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of 3D stereo reconstruction of human body joints in fisheye images. Left/middle: input images and [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 9.** Figure 9: Reconstruction frequencies of individual limbs over [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

read the original abstract

We propose a novel 3D human pose detector using two panoramic cameras. We show that transforming fisheye perspectives to rectilinear views allows a direct application of two-dimensional deep-learning pose estimation methods, without the explicit need for a costly re-training step to compensate for fisheye image distortions. By utilizing panoramic cameras, our method is capable of accurately estimating human poses over a large field of view. This renders our method suitable for ergonomic analyses and other pose based assessments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rectification lets you reuse 2D pose nets on panoramic images for 3D stereo, but the abstract supplies zero numbers so the claim stays untested.

read the letter

The main takeaway is that this paper describes a practical workaround: take images from two panoramic cameras, rectify the fisheye views into rectilinear ones, feed them to an off-the-shelf 2D pose estimator, and triangulate the results for 3D. The point is to cover a large field of view for ergonomics without retraining the networks on distorted data. That reuse of existing models is the engineering move they highlight, and it could save effort if the rectification step does not degrade the 2D outputs too much. The setup targets applied settings where wide-area pose tracking matters more than pushing the state of the art in distortion handling. What they do reasonably is identify the convenience for practitioners who already trust standard 2D networks and want to extend coverage with panoramic hardware. The approach stays simple and avoids the cost of domain-specific retraining. The soft spot is the total absence of any quantitative evidence. No accuracy numbers, no baseline comparisons, no error breakdown on how much the resampling affects keypoint localization or the final 3D triangulation. The stress-test worry about interpolation artifacts and uneven resolution loss is plausible on its face, and nothing in the abstract rules it out or bounds the impact. Without data, the central assertion that unmodified 2D methods work directly remains a hypothesis rather than a demonstrated result. This kind of paper would mainly interest readers working on industrial vision or human-factor applications who need quick wide-FOV solutions. A serious referee might be worth it only if the full manuscript contains solid experiments with real error metrics and comparisons; otherwise the work is too thin on evidence to justify the time. I would not send it for review in its current form.

Referee Report

2 major / 1 minor

Summary. The paper proposes a 3D human pose detector that uses a pair of panoramic cameras. Its central claim is that converting fisheye images to rectilinear views permits unmodified, off-the-shelf 2D deep-learning pose estimators to be applied directly, after which stereo triangulation recovers 3D poses over a large field of view, making the method suitable for ergonomic analysis without the need for costly retraining on distorted imagery.

Significance. If the rectification step truly preserves the appearance and geometry statistics required by existing 2D networks, the method would offer a practical route to large-FOV 3D pose estimation that re-uses mature 2D models and avoids domain-specific retraining. The absence of any quantitative validation, however, leaves both the magnitude of any rectification-induced error and the overall accuracy of the pipeline untested.

major comments (2)

[Abstract] Abstract: the assertion that fisheye-to-rectilinear conversion 'allows a direct application of two-dimensional deep-learning pose estimation methods, without the explicit need for a costly re-training step' is presented without any supporting experiments, error metrics, or baseline comparisons; no quantitative evidence is supplied that joint-localization accuracy remains adequate for subsequent stereo triangulation.
[Method] Method (transformation step): the paper supplies no analysis or bound on the spatially varying interpolation artifacts and resolution loss that rectification necessarily introduces; if these degrade keypoint localization on limbs, the downstream stereo reconstruction error increases directly, undermining the 'no re-training needed' claim.

minor comments (1)

The title refers to 'Panoramic Cameras' while the abstract specifies 'fisheye perspectives'; a brief clarification of the optical model and the exact rectification pipeline would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concerns regarding the lack of quantitative validation in our response below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that fisheye-to-rectilinear conversion 'allows a direct application of two-dimensional deep-learning pose estimation methods, without the explicit need for a costly re-training step' is presented without any supporting experiments, error metrics, or baseline comparisons; no quantitative evidence is supplied that joint-localization accuracy remains adequate for subsequent stereo triangulation.

Authors: The referee is correct that the manuscript does not include quantitative experiments validating the claim in the abstract. To address this, we will revise the manuscript to include a quantitative evaluation of 2D keypoint detection accuracy on rectified versus original images using a public dataset, providing error metrics and comparisons to support the 'no re-training' assertion. revision: yes
Referee: [Method] Method (transformation step): the paper supplies no analysis or bound on the spatially varying interpolation artifacts and resolution loss that rectification necessarily introduces; if these degrade keypoint localization on limbs, the downstream stereo reconstruction error increases directly, undermining the 'no re-training needed' claim.

Authors: We agree that a detailed analysis of the rectification artifacts is missing. In the revised manuscript, we will add a subsection analyzing the resolution loss and interpolation effects across the image, providing bounds on the potential impact to keypoint localization to justify the use of unmodified 2D estimators. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method claim is self-contained

full rationale

The paper's central claim—that fisheye-to-rectilinear rectification permits unmodified 2D pose networks to be applied directly—is a methodological assertion resting on the geometric properties of standard image undistortion, not on any derivation, fitted parameter, or self-citation chain. No equations appear in the provided text, no parameters are tuned to a subset of results and then re-predicted, and no uniqueness theorems or ansatzes are imported from prior author work. The argument therefore contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that rectification followed by off-the-shelf 2D pose estimation plus stereo triangulation will produce usable 3D output; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Rectilinear transformation of fisheye images is sufficiently accurate that unmodified 2D pose networks remain effective
Invoked by the statement that no re-training is needed.
domain assumption Stereo reconstruction from two panoramic views can recover 3D human poses once 2D joints are located
Implicit in the overall pipeline description.

pith-pipeline@v0.9.0 · 5609 in / 1242 out tokens · 29846 ms · 2026-05-25T12:22:35.222427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 7 internal anchors

[1]

Human activity analysis: A review,

J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR) , vol. 43, no. 3, p. 16, 2011

work page 2011
[2]

A survey of advances in vision-based human motion capture and analysis,

T. B. Moeslund, A. Hilton, and V . Kr ¨uger, “A survey of advances in vision-based human motion capture and analysis,” Computer vision and image understanding , vol. 104, no. 2-3, pp. 90–126, 2006

work page 2006
[3]

Gypsy motion capture system,

M. Motion, “Gypsy motion capture system,” 2004

work page 2004
[4]

Xsens mvn: full 6dof human motion tracking using miniature inertial sensors,

D. Roetenberg, H. Luinge, and P. Slycke, “Xsens mvn: full 6dof human motion tracking using miniature inertial sensors,” Xsens Motion Technologies BV , Tech. Rep, vol. 1, 2009

work page 2009
[5]

Vicon Motion Systems

I. Vicon Motion Systems, “Vicon Motion Systems.” http://www.vicon.com

work page
[6]

Qualisys.The Swedish motion capture company

I. Qualisys, “Qualisys.The Swedish motion capture company.” http://www.qualisys.com

work page
[7]

Pictorial structures for object recognition,

P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” International journal of computer vision , vol. 61, no. 1, pp. 55–79, 2005

work page 2005
[8]

Pictorial structures revisited: People detection and articulated pose estimation,

M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: People detection and articulated pose estimation,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pp. 1014–1021, IEEE, 2009

work page 2009
[9]

Poselet conditioned pictorial structures,

L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet conditioned pictorial structures,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on , pp. 588–595, IEEE, 2013

work page 2013
[10]

Articulated human detection with ﬂexible mixtures of parts,

Y . Yang and D. Ramanan, “Articulated human detection with ﬂexible mixtures of parts,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 2878–2890, 2013

work page 2013
[11]

Reconstruction of 3d human body pose from stereo image sequences based on top-down learning,

H.-D. Yang and S.-W. Lee, “Reconstruction of 3d human body pose from stereo image sequences based on top-down learning,” Pattern Recognition, vol. 40, no. 11, pp. 3120–3131, 2007

work page 2007
[12]

An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement,

Y . Matsumoto and A. Zelinsky, “An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement,” in Automatic Face and Gesture Recognition, 2000. Proceedings. F ourth IEEE International Conference on , pp. 499–504, IEEE, 2000

work page 2000
[13]

Sensor fusion for 3d human body tracking with an articulated 3d body model,

S. Knoop, S. Vacek, and R. Dillmann, “Sensor fusion for 3d human body tracking with an articulated 3d body model,” in Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference on, pp. 1686–1691, IEEE, 2006

work page 2006
[14]

Constrained optimization for human pose estimation from depth sequences,

Y . Zhu and K. Fujimura, “Constrained optimization for human pose estimation from depth sequences,” in Asian Conference on Computer Vision, pp. 408–418, Springer, 2007

work page 2007
[15]

Real-time identiﬁcation and localization of body parts from depth images,

C. Plagemann, V . Ganapathi, D. Koller, and S. Thrun, “Real-time identiﬁcation and localization of body parts from depth images,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, pp. 3108–3113, IEEE, 2010

work page 2010
[16]

Real-time human pose recognition in parts from single depth images,

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , pp. 1297–1304, Ieee, 2011

work page 2011
[17]

Flowing convnets for human pose estimation in videos,

T. Pﬁster, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 1913–1921, 2015

work page 1913
[18]

Pose for Action - Action for Pose

U. Iqbal, M. Garbade, and J. Gall, “Pose for action-action for pose,” arXiv preprint arXiv:1603.04037 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

PoseTrack: Joint Multi-Person Pose Estimation and Tracking

U. Iqbal, A. Milan, and J. Gall, “Pose-track: Joint multi-person pose estimation and tracking,” CoRR, vol. abs/1611.07727, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

2d human pose estimation: New benchmark and state of the art analysis,

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693, 2014

work page 2014
[21]

Microsoft COCO: Common Objects in Context

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Gir- shick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

DeepPose: Human Pose Estimation via Deep Neural Networks

A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” CoRR, vol. abs/1312.4659, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds,

Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh, “Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds,” in CVPR, vol. 1, p. 7, 2017

work page 2017
[25]

A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization,

K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich, and J. Yamato, “A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization,” in Proceedings of the 10th international conference on Multimodal interfaces, pp. 257–264, ACM, 2008

work page 2008
[26]

Safety for a robot arm moving amidst humans by using panoramic vision,

E. Cervera, N. Garcia-Aracil, E. Martinez, L. Nomdedeu, and A. P. Del Pobil, “Safety for a robot arm moving amidst humans by using panoramic vision,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on , pp. 2183–2188, IEEE, 2008

work page 2008
[27]

Simultaneous tracking of head poses in a panoramic view,

R. Stiefelhagen, J. Yang, and A. Waibel, “Simultaneous tracking of head poses in a panoramic view,” in Pattern Recognition, 2000. Proceedings. 15th International Conference on , vol. 3, pp. 722–725, IEEE, 2000. Draft

work page 2000
[28]

A ﬂexible new technique for camera calibration,

Z. Zhang, “A ﬂexible new technique for camera calibration,” IEEE Transactions on pattern analysis and machine intelligence , vol. 22, no. 11, pp. 1330–1334, 2000

work page 2000
[29]

Decentering distortion of lenses,

D. C. Brown, “Decentering distortion of lenses,” Photogrammetric Engineering and Remote Sensing , 1966

work page 1966
[30]

Alternative models for ﬁsh-eye lenses,

A. Basu and S. Licardie, “Alternative models for ﬁsh-eye lenses,” Pattern recognition letters , vol. 16, no. 4, pp. 433–441, 1995

work page 1995
[31]

Validation of geometric models for ﬁsheye lenses,

D. Schneider, E. Schwalbe, and H.-G. Maas, “Validation of geometric models for ﬁsheye lenses,” ISPRS Journal of Photogrammetry and Remote Sensing , vol. 64, no. 3, pp. 259–266, 2009

work page 2009
[32]

Accuracy of ﬁsh-eye lens models,

C. Hughes, P. Denny, E. Jones, and M. Glavin, “Accuracy of ﬁsh-eye lens models,” Applied optics , vol. 49, no. 17, pp. 3338–3347, 2010

work page 2010
[33]

A generic camera model and calibration method for conventional, wide-angle, and ﬁsh-eye lenses,

J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and ﬁsh-eye lenses,” IEEE transactions on pattern analysis and machine intelligence , vol. 28, no. 8, pp. 1335–1340, 2006

work page 2006
[34]

Intrinsic parameter calibration procedure for a (high-distortion) ﬁsh-eye lens camera with distortion model and accuracy estimation,

S. Shah and J. Aggarwal, “Intrinsic parameter calibration procedure for a (high-distortion) ﬁsh-eye lens camera with distortion model and accuracy estimation,” Pattern Recognition, vol. 29, no. 11, pp. 1775– 1788, 1996

work page 1996
[35]

Generalized camera calibration including ﬁsh-eye lenses,

D. B. Gennery, “Generalized camera calibration including ﬁsh-eye lenses,” International Journal of Computer Vision , vol. 68, no. 3, pp. 239–266, 2006

work page 2006
[36]

Calibration of ﬁsheye lenses by inversion of area projections,

T. J. Herbert, “Calibration of ﬁsheye lenses by inversion of area projections,” Applied optics , vol. 25, no. 12, pp. 1875–1876, 1986

work page 1986
[37]

Simple Online and Realtime Tracking with a Deep Association Metric

N. Wojke, A. Bewley, and D. Paulus, “Simple online and re- altime tracking with a deep association metric,” arXiv preprint arXiv:1703.07402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Real-time foreground–background segmentation using codebook model,

K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Real-time foreground–background segmentation using codebook model,” Real- time imaging , vol. 11, no. 3, pp. 172–185, 2005

work page 2005
[39]

Fish-eye-stereo calibration and epipolar rectiﬁcation,

S. Abraham and W. F ¨orstner, “Fish-eye-stereo calibration and epipolar rectiﬁcation,”ISPRS Journal of photogrammetry and remote sensing , vol. 59, no. 5, pp. 278–288, 2005

work page 2005
[40]

Hartley and A

R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003

work page 2003
[41]

Y . Ma, S. Soatto, J. Kosecka, and S. S. Sastry, An invitation to 3-d vision: from images to geometric models , vol. 26. Springer Science & Business Media, 2012

work page 2012
[42]

Stereo from uncalibrated cam- eras,

R. Hartley, R. Gupta, and T. Chang, “Stereo from uncalibrated cam- eras,” in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on , pp. 761–764, IEEE, 1992

work page 1992
[43]

Triangulation,

R. I. Hartley and P. Sturm, “Triangulation,” Computer vision and image understanding, vol. 68, no. 2, pp. 146–157, 1997

work page 1997
[44]

A method for the solution of certain non-linear problems in least squares,

K. Levenberg, “A method for the solution of certain non-linear problems in least squares,” Quarterly of applied mathematics , vol. 2, no. 2, pp. 164–168, 1944

work page 1944
[45]

Dual Path Networks for Multi-Person Human Pose Estimation

G. Ning and Z. He, “Dual path networks for multi-person human pose estimation,” arXiv preprint arXiv:1710.10192 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Human activity analysis: A review,

J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR) , vol. 43, no. 3, p. 16, 2011

work page 2011

[2] [2]

A survey of advances in vision-based human motion capture and analysis,

T. B. Moeslund, A. Hilton, and V . Kr ¨uger, “A survey of advances in vision-based human motion capture and analysis,” Computer vision and image understanding , vol. 104, no. 2-3, pp. 90–126, 2006

work page 2006

[3] [3]

Gypsy motion capture system,

M. Motion, “Gypsy motion capture system,” 2004

work page 2004

[4] [4]

Xsens mvn: full 6dof human motion tracking using miniature inertial sensors,

D. Roetenberg, H. Luinge, and P. Slycke, “Xsens mvn: full 6dof human motion tracking using miniature inertial sensors,” Xsens Motion Technologies BV , Tech. Rep, vol. 1, 2009

work page 2009

[5] [5]

Vicon Motion Systems

I. Vicon Motion Systems, “Vicon Motion Systems.” http://www.vicon.com

work page

[6] [6]

Qualisys.The Swedish motion capture company

I. Qualisys, “Qualisys.The Swedish motion capture company.” http://www.qualisys.com

work page

[7] [7]

Pictorial structures for object recognition,

P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” International journal of computer vision , vol. 61, no. 1, pp. 55–79, 2005

work page 2005

[8] [8]

Pictorial structures revisited: People detection and articulated pose estimation,

M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: People detection and articulated pose estimation,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pp. 1014–1021, IEEE, 2009

work page 2009

[9] [9]

Poselet conditioned pictorial structures,

L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet conditioned pictorial structures,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on , pp. 588–595, IEEE, 2013

work page 2013

[10] [10]

Articulated human detection with ﬂexible mixtures of parts,

Y . Yang and D. Ramanan, “Articulated human detection with ﬂexible mixtures of parts,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 2878–2890, 2013

work page 2013

[11] [11]

Reconstruction of 3d human body pose from stereo image sequences based on top-down learning,

H.-D. Yang and S.-W. Lee, “Reconstruction of 3d human body pose from stereo image sequences based on top-down learning,” Pattern Recognition, vol. 40, no. 11, pp. 3120–3131, 2007

work page 2007

[12] [12]

An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement,

Y . Matsumoto and A. Zelinsky, “An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement,” in Automatic Face and Gesture Recognition, 2000. Proceedings. F ourth IEEE International Conference on , pp. 499–504, IEEE, 2000

work page 2000

[13] [13]

Sensor fusion for 3d human body tracking with an articulated 3d body model,

S. Knoop, S. Vacek, and R. Dillmann, “Sensor fusion for 3d human body tracking with an articulated 3d body model,” in Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference on, pp. 1686–1691, IEEE, 2006

work page 2006

[14] [14]

Constrained optimization for human pose estimation from depth sequences,

Y . Zhu and K. Fujimura, “Constrained optimization for human pose estimation from depth sequences,” in Asian Conference on Computer Vision, pp. 408–418, Springer, 2007

work page 2007

[15] [15]

Real-time identiﬁcation and localization of body parts from depth images,

C. Plagemann, V . Ganapathi, D. Koller, and S. Thrun, “Real-time identiﬁcation and localization of body parts from depth images,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, pp. 3108–3113, IEEE, 2010

work page 2010

[16] [16]

Real-time human pose recognition in parts from single depth images,

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , pp. 1297–1304, Ieee, 2011

work page 2011

[17] [17]

Flowing convnets for human pose estimation in videos,

T. Pﬁster, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 1913–1921, 2015

work page 1913

[18] [18]

Pose for Action - Action for Pose

U. Iqbal, M. Garbade, and J. Gall, “Pose for action-action for pose,” arXiv preprint arXiv:1603.04037 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

PoseTrack: Joint Multi-Person Pose Estimation and Tracking

U. Iqbal, A. Milan, and J. Gall, “Pose-track: Joint multi-person pose estimation and tracking,” CoRR, vol. abs/1611.07727, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

2d human pose estimation: New benchmark and state of the art analysis,

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693, 2014

work page 2014

[21] [21]

Microsoft COCO: Common Objects in Context

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Gir- shick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[23] [23]

DeepPose: Human Pose Estimation via Deep Neural Networks

A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” CoRR, vol. abs/1312.4659, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[24] [24]

Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds,

Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh, “Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds,” in CVPR, vol. 1, p. 7, 2017

work page 2017

[25] [25]

A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization,

K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich, and J. Yamato, “A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization,” in Proceedings of the 10th international conference on Multimodal interfaces, pp. 257–264, ACM, 2008

work page 2008

[26] [26]

Safety for a robot arm moving amidst humans by using panoramic vision,

E. Cervera, N. Garcia-Aracil, E. Martinez, L. Nomdedeu, and A. P. Del Pobil, “Safety for a robot arm moving amidst humans by using panoramic vision,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on , pp. 2183–2188, IEEE, 2008

work page 2008

[27] [27]

Simultaneous tracking of head poses in a panoramic view,

R. Stiefelhagen, J. Yang, and A. Waibel, “Simultaneous tracking of head poses in a panoramic view,” in Pattern Recognition, 2000. Proceedings. 15th International Conference on , vol. 3, pp. 722–725, IEEE, 2000. Draft

work page 2000

[28] [28]

A ﬂexible new technique for camera calibration,

Z. Zhang, “A ﬂexible new technique for camera calibration,” IEEE Transactions on pattern analysis and machine intelligence , vol. 22, no. 11, pp. 1330–1334, 2000

work page 2000

[29] [29]

Decentering distortion of lenses,

D. C. Brown, “Decentering distortion of lenses,” Photogrammetric Engineering and Remote Sensing , 1966

work page 1966

[30] [30]

Alternative models for ﬁsh-eye lenses,

A. Basu and S. Licardie, “Alternative models for ﬁsh-eye lenses,” Pattern recognition letters , vol. 16, no. 4, pp. 433–441, 1995

work page 1995

[31] [31]

Validation of geometric models for ﬁsheye lenses,

D. Schneider, E. Schwalbe, and H.-G. Maas, “Validation of geometric models for ﬁsheye lenses,” ISPRS Journal of Photogrammetry and Remote Sensing , vol. 64, no. 3, pp. 259–266, 2009

work page 2009

[32] [32]

Accuracy of ﬁsh-eye lens models,

C. Hughes, P. Denny, E. Jones, and M. Glavin, “Accuracy of ﬁsh-eye lens models,” Applied optics , vol. 49, no. 17, pp. 3338–3347, 2010

work page 2010

[33] [33]

A generic camera model and calibration method for conventional, wide-angle, and ﬁsh-eye lenses,

J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and ﬁsh-eye lenses,” IEEE transactions on pattern analysis and machine intelligence , vol. 28, no. 8, pp. 1335–1340, 2006

work page 2006

[34] [34]

Intrinsic parameter calibration procedure for a (high-distortion) ﬁsh-eye lens camera with distortion model and accuracy estimation,

S. Shah and J. Aggarwal, “Intrinsic parameter calibration procedure for a (high-distortion) ﬁsh-eye lens camera with distortion model and accuracy estimation,” Pattern Recognition, vol. 29, no. 11, pp. 1775– 1788, 1996

work page 1996

[35] [35]

Generalized camera calibration including ﬁsh-eye lenses,

D. B. Gennery, “Generalized camera calibration including ﬁsh-eye lenses,” International Journal of Computer Vision , vol. 68, no. 3, pp. 239–266, 2006

work page 2006

[36] [36]

Calibration of ﬁsheye lenses by inversion of area projections,

T. J. Herbert, “Calibration of ﬁsheye lenses by inversion of area projections,” Applied optics , vol. 25, no. 12, pp. 1875–1876, 1986

work page 1986

[37] [37]

Simple Online and Realtime Tracking with a Deep Association Metric

N. Wojke, A. Bewley, and D. Paulus, “Simple online and re- altime tracking with a deep association metric,” arXiv preprint arXiv:1703.07402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Real-time foreground–background segmentation using codebook model,

K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Real-time foreground–background segmentation using codebook model,” Real- time imaging , vol. 11, no. 3, pp. 172–185, 2005

work page 2005

[39] [39]

Fish-eye-stereo calibration and epipolar rectiﬁcation,

S. Abraham and W. F ¨orstner, “Fish-eye-stereo calibration and epipolar rectiﬁcation,”ISPRS Journal of photogrammetry and remote sensing , vol. 59, no. 5, pp. 278–288, 2005

work page 2005

[40] [40]

Hartley and A

R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003

work page 2003

[41] [41]

Y . Ma, S. Soatto, J. Kosecka, and S. S. Sastry, An invitation to 3-d vision: from images to geometric models , vol. 26. Springer Science & Business Media, 2012

work page 2012

[42] [42]

Stereo from uncalibrated cam- eras,

R. Hartley, R. Gupta, and T. Chang, “Stereo from uncalibrated cam- eras,” in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on , pp. 761–764, IEEE, 1992

work page 1992

[43] [43]

Triangulation,

R. I. Hartley and P. Sturm, “Triangulation,” Computer vision and image understanding, vol. 68, no. 2, pp. 146–157, 1997

work page 1997

[44] [44]

A method for the solution of certain non-linear problems in least squares,

K. Levenberg, “A method for the solution of certain non-linear problems in least squares,” Quarterly of applied mathematics , vol. 2, no. 2, pp. 164–168, 1944

work page 1944

[45] [45]

Dual Path Networks for Multi-Person Human Pose Estimation

G. Ning and Z. He, “Dual path networks for multi-person human pose estimation,” arXiv preprint arXiv:1710.10192 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017