pith. sign in

arxiv: 1907.03698 · v1 · pith:TDDBA7EVnew · submitted 2019-07-08 · 💻 cs.LG · cs.CV· cs.MM· stat.ML

TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications

Pith reviewed 2026-05-25 01:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.MMstat.ML
keywords tennis ball trackingdeep learningheatmap detectionsports video analysishigh-speed object trackingbroadcast videoobject localization
0
0 comments X

The pith

TrackNet tracks high-speed tiny tennis balls in broadcast videos by generating heatmaps from single or consecutive frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TrackNet to locate a small fast-moving tennis ball in sports broadcast videos where the ball is often blurry, streaked with afterimages, or invisible. The network is trained to detect the ball from its appearance in one frame and its motion patterns across several frames, then outputs a probability heatmap to mark the position. A reader would care because ball trajectory information underpins evaluation of player performance and game strategy analysis, yet existing methods fail on these difficult cases. The system processes standard 640 by 360 broadcast frames and is tested directly on a public YouTube video of a major tournament final.

Core claim

TrackNet is a deep learning network that accepts 640 by 360 images and produces a detection heatmap from either one frame or multiple consecutive frames to locate the tennis ball. On the 2017 Summer Universiade men's singles final video it attains 99.7 percent precision, 97.3 percent recall, and 98.5 percent F1-measure. When 10-fold cross-validation is performed with nine additional partially labeled videos, the figures become 95.3 percent precision, 75.7 percent recall, and 84.3 percent F1-measure. The same videos show that TrackNet substantially outperforms a conventional image-processing baseline.

What carries the argument

Heatmap-based convolutional network trained to output ball-position probability maps by combining single-frame appearance cues with multi-frame motion patterns.

If this is right

  • Ball trajectories can be extracted automatically from any broadcast video, supporting large-scale performance and strategy studies without frame-by-frame manual labeling.
  • The network handles the full range of real-world difficulties including small size, blur, afterimage tracks, and temporary invisibility.
  • Performance on publicly available YouTube footage shows the method works on typical consumer-accessible sports recordings.
  • Direct comparison on identical data establishes that the learned heatmap approach exceeds conventional image-processing pipelines by a wide margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same heatmap-plus-multi-frame design could be retrained for other small fast objects such as shuttlecocks or table-tennis balls with only modest additional labeling.
  • Embedding TrackNet in a live pipeline would permit real-time trajectory overlays during broadcasts once inference speed is optimized.
  • Expanding the labeled set across varied lighting, court surfaces, and camera angles would likely raise the cross-validation recall above the current 75.7 percent.

Load-bearing premise

The human-provided ball position labels in the training and cross-validation videos are accurate and representative enough that the learned heatmap patterns generalize to new broadcast footage without systematic bias from labeling errors or domain shift.

What would settle it

Running the published TrackNet weights on a fresh set of tennis broadcast videos never seen during training or cross-validation and obtaining an F1-measure well below 84 percent would demonstrate that the reported accuracy does not hold in general.

Figures

Figures reproduced from arXiv: 1907.03698 by Ching-Hsuan Chen, I-No Liao, Ts\`i-U\'i \.Ik, Wen-Chih Peng, Yu-Chuan Huang.

Figure 1
Figure 1. Figure 1: Convolution operation in deep learning networks. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of the prolonged tennis trace. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: The ball image is hardly visible [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: A hit case: (a) and (b) are labeled as flying, and (c) is labeled [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of the prolonged badminton trace. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of the detection heatmap. TrackNet is composed of a convolutional neural network (CNN) followed by a deconvolutional neu￾ral network (DeconvNet) [7]. It takes consecutive frames to generate a heatmap indicating the position of the object. The number of input frames is a network parameter. One input frame is considered the conventional CNN network. TrackNet with more than one input frame can impr… view at source ↗
Figure 9
Figure 9. Figure 9: The architecture of the proposed TrackNet. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The loss curve of TrackNet model training. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The distribution of the positioning error. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
read the original abstract

Ball trajectory data are one of the most fundamental and useful information in the evaluation of players' performance and analysis of game strategies. Although vision-based object tracking techniques have been developed to analyze sport competition videos, it is still challenging to recognize and position a high-speed and tiny ball accurately. In this paper, we develop a deep learning network, called TrackNet, to track the tennis ball from broadcast videos in which the ball images are small, blurry, and sometimes with afterimage tracks or even invisible. The proposed heatmap-based deep learning network is trained to not only recognize the ball image from a single frame but also learn flying patterns from consecutive frames. TrackNet takes images with a size of $640\times360$ to generate a detection heatmap from either a single frame or several consecutive frames to position the ball and can achieve high precision even on public domain videos. The network is evaluated on the video of the men's singles final at the 2017 Summer Universiade, which is available on YouTube. The precision, recall, and F1-measure of TrackNet reach $99.7\%$, $97.3\%$, and $98.5\%$, respectively. To prevent overfitting, 9 additional videos are partially labeled together with a subset from the previous dataset to implement 10-fold cross-validation, and the precision, recall, and F1-measure are $95.3\%$, $75.7\%$, and $84.3\%$, respectively. A conventional image processing algorithm is also implemented to compare with TrackNet. Our experiments indicate that TrackNet outperforms conventional method by a big margin and achieves exceptional ball tracking performance. The dataset and demo video are available at https://nol.cs.nctu.edu.tw/ndo3je6av9/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TrackNet, a heatmap-based deep learning network for tracking high-speed and tiny tennis balls in broadcast videos. The network processes single frames or sequences of consecutive frames (640x360 resolution) to output detection heatmaps for ball positioning. It reports precision/recall/F1 of 99.7%/97.3%/98.5% on the 2017 Universiade men's singles final video and 95.3%/75.7%/84.3% under 10-fold cross-validation on 9 additional partially labeled videos plus a subset; it also outperforms a conventional image processing baseline.

Significance. If the results hold, TrackNet would offer a practical advance for sports video analytics by handling small, fast, blurry objects via learned multi-frame patterns, where conventional methods fail. The public dataset release and direct comparison to a baseline strengthen the contribution for reproducibility and benchmarking in computer vision for sports applications.

major comments (2)
  1. [Experiments / Dataset description] The reported metrics (abstract and Experiments section) are computed directly against human-provided ball position labels, but the manuscript supplies no details on the labeling process, inter-annotator agreement, pixel-error distribution, or verification protocol for the 'partially labeled' videos. For objects described as 'small, blurry, and sometimes with afterimage tracks or even invisible,' this is load-bearing: systematic annotation noise or bias would render both the single-video and cross-validation F1 scores unreliable.
  2. [Experiments] Table or results reporting the 10-fold CV (Experiments section): recall drops from 97.3% (single video) to 75.7% (CV), yet no analysis is provided of per-fold variance, domain shift between videos, or whether labeling inconsistencies contribute to the gap. This undermines the claim that the CV 'prevents overfitting' and supports generalization.
minor comments (1)
  1. [Abstract] The abstract states that 9 additional videos were 'partially labeled together with a subset from the previous dataset' but does not specify the number of frames labeled per video or the selection criteria for the subset.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments regarding dataset labeling and cross-validation analysis. We address each point below and commit to revisions that strengthen the manuscript without overstating what was originally done.

read point-by-point responses
  1. Referee: [Experiments / Dataset description] The reported metrics (abstract and Experiments section) are computed directly against human-provided ball position labels, but the manuscript supplies no details on the labeling process, inter-annotator agreement, pixel-error distribution, or verification protocol for the 'partially labeled' videos. For objects described as 'small, blurry, and sometimes with afterimage tracks or even invisible,' this is load-bearing: systematic annotation noise or bias would render both the single-video and cross-validation F1 scores unreliable.

    Authors: We agree that additional details on the labeling process are necessary given the difficulty of the task. The ball positions were annotated manually by experienced annotators using standard video labeling software, with positions estimated from trajectory continuity when the ball was blurry or invisible. We will revise the Experiments section to describe this procedure, the tools used, and steps taken to verify labels on a subset of frames. Pixel-error distribution was not systematically recorded. Formal inter-annotator agreement was not computed because each video was labeled by one primary annotator for consistency; this limitation will be explicitly noted in the revision. revision: yes

  2. Referee: [Experiments] Table or results reporting the 10-fold CV (Experiments section): recall drops from 97.3% (single video) to 75.7% (CV), yet no analysis is provided of per-fold variance, domain shift between videos, or whether labeling inconsistencies contribute to the gap. This undermines the claim that the CV 'prevents overfitting' and supports generalization.

    Authors: The performance drop reflects greater video diversity (different courts, lighting, and camera setups) across the ten videos compared with the single final-match video. We will add per-fold metrics and variance statistics to the Experiments section, along with a short discussion of observed domain differences. We will also examine whether low-recall folds correlate with particular labeling challenges. This analysis was omitted from the original submission but can be included without new experiments. revision: yes

standing simulated objections not resolved
  • Quantitative inter-annotator agreement or pixel-error distribution statistics, as these were not collected during the original labeling effort.

Circularity Check

0 steps flagged

No circularity: standard supervised evaluation on held-out data

full rationale

The paper trains a heatmap-based CNN on labeled frames (single or consecutive) to output ball position heatmaps and reports precision/recall/F1 on a held-out Universiade video plus 10-fold CV using additional partially labeled videos. No equations, fitted parameters, or self-citations are invoked such that any reported metric reduces by construction to a quantity defined from the training inputs themselves. Evaluation follows ordinary train/test separation with no self-definitional, fitted-input-renamed-as-prediction, or uniqueness-via-self-citation patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on supervised training of a convolutional network whose parameters are fitted to human-labeled ball positions; no new physical entities or mathematical axioms beyond standard deep-learning assumptions are introduced.

free parameters (1)
  • network weights
    All convolutional and fully-connected layer parameters are fitted during training on the labeled video frames to minimize heatmap regression loss.
axioms (1)
  • domain assumption Heatmap regression on image patches can localize small objects even when they are blurry or partially occluded
    The method assumes this property holds for tennis balls in broadcast footage so that the network output reliably indicates ball position.

pith-pipeline@v0.9.0 · 5885 in / 1456 out tokens · 29218 ms · 2026-05-25T01:00:08.641370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    Object detection and tracking based on trajectory in broadcast tennis video,

    M. Archana and M. K. Geetha, “Object detection and tracking based on trajectory in broadcast tennis video,” Procedia Com- puter Science , vol. 58, pp. 225–232, 2015

  2. [2]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

  3. [3]

    Rich feature hierarchies for accurate object detection and semantic segmen- tation,

    R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmen- tation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014) , 23-28 June 2014, pp. 580–587

  4. [4]

    Fast R-CNN,

    R. Girshick, “Fast R-CNN,” in International Conference on Computer Vision (ICCV 2015) , 11-18 December 2015, pp. 1440–1448

  5. [5]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems , 2015, pp. 91–99

  6. [6]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2016, pp. 779–788

  7. [7]

    Learning deconvolution net- work for semantic segmentation,

    H. Noh, S. Hong, and B. Han, “Learning deconvolution net- work for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision , 2015, pp. 1520– 1528

  8. [8]

    Ball tracking and 3D trajectory approximation with applications to tactics analysis from single-camera volleyball sequences,

    H.-T. Chen, W.-J. Tsai, S.-Y . Lee, and J.-Y . Yu, “Ball tracking and 3D trajectory approximation with applications to tactics analysis from single-camera volleyball sequences,” Multimedia Tools and Applications , vol. 60, no. 3, pp. 641–667, October 2012

  9. [9]

    Take your eyes off the ball: Improving ball-tracking by focusing on team play,

    X. Wang, V . Ablavsky, H. B. Shitrit, and P. Fua, “Take your eyes off the ball: Improving ball-tracking by focusing on team play,” Computer Vision and Image Understanding , vol. 119, pp. 102–115, February 2014

  10. [10]

    Screen-strategy analysis in broadcast basketball video using player tracking,

    T.-S. Fu, H.-T. Chen, C.-L. Chou, W.-J. Tsai, and S.-Y . Lee, “Screen-strategy analysis in broadcast basketball video using player tracking,” in Processing of the 2011 IEEE Visual Com- munications and Image (VCIP) , 6-9 November 2011

  11. [11]

    Tracking a table tennis ball for umpiring purposes,

    H. Myint, P. Wong, L. Dooley, and A. Hopgood, “Tracking a table tennis ball for umpiring purposes,” in Proceedings of the 14th IAPR International Conference on Machine Vision Applications (MVA 2015) , 18-22 May 2015, pp. 170–173

  12. [12]

    Hawk-eye,

    “Hawk-eye,” https://en.wikipedia.org/wiki/Hawk-Eye

  13. [13]

    A trajectory- based ball detection and tracking algorithm in broadcast tennis video,

    X. Yu, C.-H. Sim, J. R. Wang, and L. F. Cheong, “A trajectory- based ball detection and tracking algorithm in broadcast tennis video,” in 2004 International Conference on Image Processing (ICIP 2004) , vol. 2. Singapore: IEEE, 24-27 October 2004, pp. 1049–1052

  14. [14]

    Real-time tracking of a tennis ball by combining 3d data and domain knowledge,

    V . Ren `o, N. Mosca, M. Nitti, C. Guaragnella, T. D’Orazio, and E. Stella, “Real-time tracking of a tennis ball by combining 3d data and domain knowledge,” in Technology and Innovation in Sports, Health and Wellbeing (TISHW), International Confer- ence on . IEEE, 2016, pp. 1–7

  15. [15]

    A tennis ball tracking algorithm for automatic annotation of tennis match,

    F. Yan, W. Christmas, and J. Kittler, “A tennis ball tracking algorithm for automatic annotation of tennis match,” in Pro- ceedings of the British Machine Vision Conference (BMVC 2005), vol. 2. Durham, England: BMV A, 5-8 September 2005, pp. 619–628

  16. [16]

    Tennis ball tracking using a two-layered data association approach,

    X. Zhou, L. Xie, Q. Huang, S. J. Cox, and Y . Zhang, “Tennis ball tracking using a two-layered data association approach,” IEEE Transactions on Multimedia , vol. 17, no. 2, pp. 145–156, 2015

  17. [17]

    Imagenet classi- fication with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105

  18. [18]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  19. [19]

    Human activity recognition using wearable sensors by deep convolutional neural networks,

    W. Jiang and Z. Yin, “Human activity recognition using wearable sensors by deep convolutional neural networks,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 1307–1310

  20. [20]

    A deep learning approach to human ac- tivity recognition based on single accelerometer,

    Y . Chen and Y . Xue, “A deep learning approach to human ac- tivity recognition based on single accelerometer,” in 2015 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2015, pp. 1488–1492

  21. [21]

    SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling

    V . Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling,” arXiv preprint arXiv:1505.07293 , 2015

  22. [22]

    Fully convolutional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440

  23. [23]

    Recurrent human pose estimation,

    V . Belagiannis and A. Zisserman, “Recurrent human pose estimation,” in 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017) . IEEE, 2017, pp. 468–475

  24. [24]

    Flowing convnets for human pose estimation in videos,

    T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in Proceedings of the IEEE International Conference on Computer Vision , 2015, pp. 1913– 1921

  25. [25]

    Hough gradient method,

    “Hough gradient method,” https://goo.gl/gZTQRm

  26. [26]

    ADADELTA: An Adaptive Learning Rate Method

    M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” arXiv preprint , vol. abs/1212.5701, 2012. [Online]. Available: http://arxiv.org/abs/1212.5701 12