pith. sign in

arxiv: 1907.08816 · v1 · pith:EVSGKYUKnew · submitted 2019-07-20 · 💻 cs.CV

Pan-tilt-zoom SLAM for Sports Videos

Pith reviewed 2026-05-24 18:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords pan-tilt-zoom cameraSLAMsports videocamera pose estimationray landmarksmoving object detectiononline mapping
0
0 comments X

The pith

An online SLAM system uses rays as landmarks to track PTZ cameras in dynamic sports videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an online SLAM method built for pan-tilt-zoom cameras filming fast-paced sports such as basketball and soccer. It replaces point landmarks with rays to address the lack of depth data that comes from pure camera rotation and adds player detection to reduce interference from large moving foreground regions. The approach also includes a novel camera model for tracking and an online pan-tilt forest for building the map. Experiments on synthetic and real datasets are presented to show improved camera pose estimates compared with earlier techniques. A sympathetic reader would care because reliable real-time camera tracking in these settings supports automated analysis and production of live sports footage.

Core claim

The authors claim that treating rays as landmarks inside a pure-rotation camera model, together with an online pan-tilt forest and explicit moving-object detection, produces more accurate online pose estimates for PTZ cameras in sports videos than previous methods.

What carries the argument

Rays as landmarks inside a pure-rotation camera model that supplies direction without depth for mapping.

If this is right

  • The system runs online and therefore supports real-time applications during live sports broadcasts.
  • Player detection reduces the disruptive effect of large foreground regions on pose estimation.
  • Ray landmarks enable mapping even when the camera undergoes only rotation.
  • An online pan-tilt forest maintains the map structure as the camera moves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ray-based representation could be tested on other pure-rotation camera scenarios outside sports.
  • Coupling the pose estimates with separate player tracking pipelines might improve overall scene reconstruction.
  • The method opens a route for handling depth-less mapping in additional SLAM variants that encounter rapid rotations.

Load-bearing premise

That rays as landmarks overcome the missing depth information in pure-rotation cameras and that moving-object detection sufficiently mitigates foreground interference.

What would settle it

A test sequence of PTZ sports video with known ground-truth camera poses in which the estimated poses deviate beyond the error levels reported for competing methods.

Figures

Figures reproduced from arXiv: 1907.08816 by James J. Little, Jianhui Chen, Jikai Lu.

Figure 1
Figure 1. Figure 1: System overview. Given a PTZ base and the first camera pose, our system outputs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The coordinate system and ray landmarks. The camera pose is represented by pan, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Player detection. Left: keypoints without player detection; right: keypoints with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Synthetic image examples. Sequence Reprojection error (pix.) EKF-H EKF-PTZ (ours) Seq. ID Velocity Mean Median Max Mean Median Max 1 0.02 0.1 0.1 0.1 0.1 0.1 0.2 2 0.83 0.4 0.4 0.7 0.3 0.1 1.1 3 0.70 1.0 0.3 16.0 0.3 0.3 0.5 4 0.08 2.1 2.2 3.9 0.7 0.7 1.3 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Estimated camera trajectories of our method. (a) basketball; (b) soccer. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with EKF-H. The left figure shows pan angle errors of [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

We present an online SLAM system specifically designed to track pan-tilt-zoom (PTZ) cameras in highly dynamic sports such as basketball and soccer games. In these games, PTZ cameras rotate very fast and players cover large image areas. To overcome these challenges, we propose to use a novel camera model for tracking and to use rays as landmarks in mapping. Rays overcome the missing depth in pure-rotation cameras. We also develop an online pan-tilt forest for mapping and introduce moving objects (players) detection to mitigate negative impacts from foreground objects. We test our method on both synthetic and real datasets. The experimental results show the superior performance of our method over previous methods for online PTZ camera pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents an online SLAM system for tracking fast-moving PTZ cameras in dynamic sports videos (e.g., basketball, soccer). It proposes a novel camera model, ray-based landmarks to address missing depth in pure-rotation scenarios, an online pan-tilt forest for mapping, and moving-object detection to reduce foreground interference. Experiments on synthetic and real datasets are claimed to demonstrate superior performance over prior methods for online PTZ pose estimation.

Significance. If the performance claims hold with proper validation, the work addresses a practical gap in sports video analysis and broadcasting, where PTZ cameras operate under extreme rotation speeds and heavy foreground occlusion. The ray-landmark idea is a direct response to the pure-rotation depth problem and could be reusable; the online pan-tilt forest is a domain-specific mapping contribution.

major comments (2)
  1. [Experiments] Experiments section: the central claim of 'superior performance' over previous PTZ methods is not supported by any reported quantitative metrics, error tables, baseline comparisons, or statistical tests in the manuscript description; without these, the assertion that ray landmarks and moving-object detection drive the gains cannot be evaluated.
  2. [Method] Method (ray landmarks and moving-object detection): no ablation studies or component-wise error breakdowns are described that isolate whether rays overcome depth loss or whether foreground detection mitigates interference; full-system comparisons alone leave open the possibility that other factors (camera model or dataset) explain results.
minor comments (1)
  1. [Abstract] Abstract and introduction repeat the performance claim without previewing any numerical results or dataset sizes, which weakens the reader's ability to assess scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to strengthen the experimental presentation.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of 'superior performance' over previous PTZ methods is not supported by any reported quantitative metrics, error tables, baseline comparisons, or statistical tests in the manuscript description; without these, the assertion that ray landmarks and moving-object detection drive the gains cannot be evaluated.

    Authors: We agree that the current presentation of results does not sufficiently document the quantitative evidence. The revised manuscript will expand the experiments section to include explicit error tables, direct numerical comparisons against prior PTZ methods, and statistical tests on both the synthetic and real datasets. revision: yes

  2. Referee: [Method] Method (ray landmarks and moving-object detection): no ablation studies or component-wise error breakdowns are described that isolate whether rays overcome depth loss or whether foreground detection mitigates interference; full-system comparisons alone leave open the possibility that other factors (camera model or dataset) explain results.

    Authors: We acknowledge that component-wise analysis would clarify the individual contributions. The revision will add ablation studies that report error breakdowns when ray landmarks and moving-object detection are enabled or disabled independently. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external experimental comparison without self-referential reduction

full rationale

The provided abstract and context describe a PTZ SLAM system using a novel camera model, ray landmarks to address pure rotation, pan-tilt forest, and moving-object detection. No equations, parameter fits, or derivations are shown that reduce a claimed result to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The central claim is empirical superiority on synthetic and real datasets, which is externally falsifiable and does not collapse into a renaming, ansatz smuggling, or fitted-input prediction. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the central technical premise is the effectiveness of ray landmarks for depth-less pure-rotation cases, treated as a domain assumption.

axioms (1)
  • domain assumption Rays overcome the missing depth in pure-rotation cameras
    Explicitly proposed in the abstract as the solution to depth absence.

pith-pipeline@v0.9.0 · 5643 in / 1028 out tokens · 22209 ms · 2026-05-24T18:53:09.996400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    CodeSLAM - Learning a compact, optimisable representation for dense vi- sual SLAM

    Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. CodeSLAM - Learning a compact, optimisable representation for dense vi- sual SLAM. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  2. [2]

    Simultaneous lo- calization and mapping: A survey of current trends in autonomous driving

    Guillaume Bresson, Zayed Alsayed, Li Yu, and Sébastien Glaser. Simultaneous lo- calization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent V ehicles, 20:1–1, 2017

  3. [3]

    Automatic panoramic image stitching using in- variant features

    Matthew Brown and David G Lowe. Automatic panoramic image stitching using in- variant features. International Journal of Computer Vision (IJCV) , 74(1):59–73, 2007

  4. [4]

    Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation

    Tommaso Cavallari, Luca Bertinetto, Jishnu Mukhoti, Philip Torr, and Stuart Golodetz. Let’s take this online: Adapting scene coordinate regression network predictions for online RGB-D camera relocalisation. arXiv preprint arXiv:1906.08744, 2019

  5. [5]

    Mimicking human camera operators

    Jianhui Chen and Peter Carr. Mimicking human camera operators. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2015

  6. [6]

    Sports camera calibration via synthetic data

    Jianhui Chen and James J Little. Sports camera calibration via synthetic data. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019

  7. [7]

    A two-point method for PTZ camera calibration in sports

    Jianhui Chen, Fangrui Zhu, and James J Little. A two-point method for PTZ camera calibration in sports. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

  8. [8]

    ARTHuS: Adaptive real-time human segmentation in sports through online distillation

    Anthony Cioppa, Adrien Deliege, Maxime Istasse, Christophe De Vleeschouwer, and Marc Van Droogenbroeck. ARTHuS: Adaptive real-time human segmentation in sports through online distillation. InIEEE Conference on Computer Vision and Pattern Recog- nition Workshops (CVPRW), 2019

  9. [9]

    Drift-free real- time sequential mosaicing

    Javier Civera, Andrew J Davison, Juan A Magallón, and JMM Montiel. Drift-free real- time sequential mosaicing. International Journal of Computer Vision (IJCV) , 81(2): 128–137, 2009

  10. [10]

    Visual-inertial direct SLAM

    Alejo Concha, Giuseppe Loianno, Vijay Kumar, and Javier Civera. Visual-inertial direct SLAM. In IEEE International Conference on Robotics and Automation (ICRA) , 2016

  11. [11]

    MonoSLAM: Real-time single camera SLAM

    Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), (6):1052–1067, 2007

  12. [12]

    Exploiting distinctive visual landmark maps in pan–tilt–zoom camera networks

    Alberto Del Bimbo, Fabrizio Dini, Giuseppe Lisanti, and Federico Pernici. Exploiting distinctive visual landmark maps in pan–tilt–zoom camera networks. Computer Vision and Image Understanding (CVIU), 114(6):611–623, 2010

  13. [13]

    Direct sparse odometry

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) , 40(3):611–625, 2018. 12 LU, CHEN AND LITTLE: PAN-TILT-ZOOM SLAM FOR SPORTS VIDEOS

  14. [14]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981

  15. [15]

    Live tracking and mapping from both general and rotation-only camera mo- tion

    Steffen Gauglitz, Chris Sweeney, Jonathan Ventura, Matthew Turk, and Tobias Höllerer. Live tracking and mapping from both general and rotation-only camera mo- tion. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR) , 2012

  16. [16]

    Using line and ellipse features for rectification of broadcast hockey video

    Ankur Gupta, James J Little, and Robert J Woodham. Using line and ellipse features for rectification of broadcast hockey video. InCanadian Conference on Computer and Robot Vision (CRV), 2011

  17. [17]

    Robust incremental rectification of sports video sequences

    Jean-Bernard Hayet, Justus Piater, and Jacques Verly. Robust incremental rectification of sports video sequences. In British Machine Vision Conference (BMVC), 2004

  18. [18]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV) , 2017

  19. [19]

    3D-TV production from conventional cameras for sports broadcast.IEEE Transactions on Broadcasting, 57(2):462–476, 2011

    Adrian Hilton, Jean-Yves Guillemaut, Joe Kilner, Oliver Grau, and Graham Thomas. 3D-TV production from conventional cameras for sports broadcast.IEEE Transactions on Broadcasting, 57(2):462–476, 2011

  20. [20]

    Sports field localization via deep structured models

    Namdar Homayounfar, Sanja Fidler, and Raquel Urtasun. Sports field localization via deep structured models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  21. [21]

    Panoptic segmentation

    Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019

  22. [22]

    LATCH: learned arrangements of three patch codes

    Gil Levi and Tal Hassner. LATCH: learned arrangements of three patch codes. InIEEE Winter Conference on Applications of Computer Vision (WACV), 2016

  23. [23]

    Continuous localization and mapping of a pan-tilt-zoom camera for wide area tracking

    Giuseppe Lisanti, Iacopo Masi, Federico Pernici, and Alberto Del Bimbo. Continuous localization and mapping of a pan-tilt-zoom camera for wide area tracking. Machine Vision and Applications (MVA), 27(7):1071–1085, 2016

  24. [24]

    Real-time spherical mosaicing using whole image alignment

    Steven Lovegrove and Andrew J Davison. Real-time spherical mosaicing using whole image alignment. In European Conference on Computer Vision (ECCV), 2010

  25. [25]

    Light cascaded convolutional neural networks for accurate player detection

    Keyu Lu, Jianhui Chen, James J Little, and He Hangen. Light cascaded convolutional neural networks for accurate player detection. In British Machine Vision Conference (BMVC), 2017

  26. [26]

    Backtracking regression forests for accurate camera relocalization

    Lili Meng, Jianhui Chen, Frederick Tung, James Little J., Julien Valentin, and Clarence Silva. Backtracking regression forests for accurate camera relocalization. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2017

  27. [27]

    ORB-SLAM: a versatile and accurate monocular SLAM system

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics , 31 (5):1147–1163, 2015. LU, CHEN AND LITTLE: PAN-TILT-ZOOM SLAM FOR SPORTS VIDEOS 13

  28. [28]

    Keep your eye on the puck: Automatic hockey videography

    Hemanth Pidaparthy and James Elder. Keep your eye on the puck: Automatic hockey videography. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

  29. [29]

    Homography-based planar mapping and tracking for mobile phones

    Christian Pirchheim and Gerhard Reitmayr. Homography-based planar mapping and tracking for mobile phones. In IEEE International Symposium on Mixed and Aug- mented Reality (ISMAR), 2011

  30. [30]

    Handling pure camera rotation in keyframe-based slam

    Christian Pirchheim, Dieter Schmalstieg, and Gerhard Reitmayr. Handling pure camera rotation in keyframe-based slam. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2013

  31. [31]

    Unsupervised calibration of camera networks and virtual PTZ cameras

    Horst Possegger, Matthias Rüther, Sabine Sternig, Thomas Mauthner, Manfred Klops- chitz, Peter M Roth, and Horst Bischof. Unsupervised calibration of camera networks and virtual PTZ cameras. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2012

  32. [32]

    Robust multi-view cam- era calibration for wide-baseline camera networks

    Jens Puwein, Remo Ziegler, Julia V ogel, and Marc Pollefeys. Robust multi-view cam- era calibration for wide-baseline camera networks. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2011

  33. [33]

    Soccer on your tabletop

    Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve Seitz. Soccer on your tabletop. In IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2018

  34. [34]

    Faster R-CNN: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Infor- mation Processing Systems (NIPS), 2015

  35. [35]

    On- line random forests

    Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof. On- line random forests. In IEEE International Conference on Computer Vision (ICCV) Workshops, 2009

  36. [36]

    Scene coordinate regression forests for camera relocalization in rgb-d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013

  37. [37]

    Pan–tilt–zoom camera calibration and high- resolution mosaic generation

    Sudipta N Sinha and Marc Pollefeys. Pan–tilt–zoom camera calibration and high- resolution mosaic generation. Computer Vision and Image Understanding (CVIU) , 103(3):170–183, 2006

  38. [38]

    Improving RGB-D SLAM in dynamic environments: A motion removal approach

    Yuxiang Sun, Ming Liu, and Max Q-H Meng. Improving RGB-D SLAM in dynamic environments: A motion removal approach. Robotics and Autonomous Systems (RAS) , 89:110–122, 2017

  39. [39]

    CNN-SLAM: Real- time dense monocular slam with learned depth prediction

    Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. CNN-SLAM: Real- time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  40. [40]

    Real-time camera tracking using sports pitch markings

    Graham Thomas. Real-time camera tracking using sports pitch markings. Journal of Real-Time Image Processing, 2(2-3):117–132, 2007. 14 LU, CHEN AND LITTLE: PAN-TILT-ZOOM SLAM FOR SPORTS VIDEOS

  41. [41]

    Computer vision for sports: Current applications and research topics

    Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics. Computer Vision and Image Understanding (CVIU), 159:3–18, 2017

  42. [42]

    Bun- dle adjustment – a modern synthesis

    Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bun- dle adjustment – a modern synthesis. In International workshop on vision algorithms , 1999

  43. [43]

    Simultaneous localization and mapping with de- tection and tracking of moving objects

    Chieh-Chih Wang and Chuck Thorpe. Simultaneous localization and mapping with de- tection and tracking of moving objects. In IEEE International Conference on Robotics and Automation (ICRA), 2002

  44. [44]

    Pop-up SLAM: Seman- tic monocular plane SLAM for low-texture environments

    Shichao Yang, Yu Song, Michael Kaess, and Sebastian Scherer. Pop-up SLAM: Seman- tic monocular plane SLAM for low-texture environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2016

  45. [45]

    Keyframe-based monocular SLAM: design, survey, and future directions

    Georges Younes, Daniel Asmar, Elie Shammas, and John Zelek. Keyframe-based monocular SLAM: design, survey, and future directions. Robotics and Autonomous Systems, 98:67–88, 2017

  46. [46]

    SceneCode: Monocular dense semantic reconstruction using learned encoded scene representations

    Shuaifeng Zhi, Michael Bloesch, Stefan Leutenegger, and Andrew J Davison. SceneCode: Monocular dense semantic reconstruction using learned encoded scene representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  47. [47]

    Detect-SLAM: Mak- ing object detection and SLAM mutually beneficial

    Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. Detect-SLAM: Mak- ing object detection and SLAM mutually beneficial. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2018