pith. sign in

arxiv: 1907.09394 · v1 · pith:TDKJDLYOnew · submitted 2019-07-22 · 💻 cs.CV

Markerless Augmented Advertising for Sports Videos

Pith reviewed 2026-05-24 18:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords markerless augmented realityvideo augmentationsports videoshomography tracking3D scene representationadvertisement placementaugmented advertising
0
0 comments X

The pith

An automated pipeline overlays advertisements in sports videos by building 3D scene models and applying homography tracking without markers or camera parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that markerless augmented advertising can be performed automatically in sports videos by identifying suitable textures, constructing a 3D representation of the scene, placing the ad within that model, projecting it back to each frame, and then tracking it across the clip. This process is designed to produce natural and perspective-correct results even under smooth camera motion or at shot boundaries. If the approach holds, ads could appear as part of the original footage rather than requiring separate commercial interruptions. A reader would care because the method removes the need for manual artist intervention or detailed camera calibration data during post-production.

Core claim

The paper claims that an automated video augmentation pipeline identifies textures of interest, builds a 3D representation of the scene, places the advertisement in 3D, projects it back onto the image plane, and uses homography-based shape-preserving tracking to achieve seamless and perspective-correct integration for the duration of a video clip, handling smooth camera motion and shot boundaries without camera intrinsics or markers.

What carries the argument

homography-based shape-preserving tracking applied after 3D advertisement placement and projection

If this is right

  • The advertisement remains aligned and natural-looking throughout the clip.
  • No skilled artist or advanced post-production editing tools are required.
  • Placement succeeds without knowledge of camera intrinsic parameters.
  • The system supports continuous viewing without separate commercial breaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline structure could be tested on non-sports video with comparable camera motion patterns.
  • Integration costs for advertising in broadcast content might decrease if tracking proves reliable.
  • Extensions could explore handling of lighting changes or partial occlusions not addressed in the current clips.

Load-bearing premise

Homography tracking can maintain perspective-correct placement across video clips with smooth camera motion and shot boundaries even without camera intrinsics or markers.

What would settle it

A sports video sequence with abrupt camera movement or multiple shot changes in which the overlaid advertisement distorts or drifts from its intended surface position.

Figures

Figures reproduced from arXiv: 1907.09394 by Cambron Carter, Divyaa Ravichandran, Emmanuel Antonio Cuevas, Hallee E. Wong, Iris Fu, Iuliana Tabian, Osman Akar.

Figure 1
Figure 1. Figure 1: Our proposed automated pipeline for augmentation. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The input image of a baseball game is segmented by PSPNet’s ADE20K [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Segmented images and the SQS associated with the quality of the seg [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of a crowd image and a inverse depth map visualization of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Perspective correct asset placement with “unnatural” and “natural” ori [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of asset placement procedure using Fig. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Features (red points) to be tracked are identified within a 50 px radius (yellow circles) of the corners (large green points) of the quadrilateral [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of the pipeline’s intermediate outputs running on a single image. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Markerless augmented reality can be a challenging computer vision task, especially in live broadcast settings and in the absence of information related to the video capture such as the intrinsic camera parameters. This typically requires the assistance of a skilled artist, along with the use of advanced video editing tools in a post-production environment. We present an automated video augmentation pipeline that identifies textures of interest and overlays an advertisement onto these regions. We constrain the advertisement to be placed in a way that is aesthetic and natural. The aim is to augment the scene such that there is no longer a need for commercial breaks. In order to achieve seamless integration of the advertisement with the original video we build a 3D representation of the scene, place the advertisement in 3D, and then project it back onto the image plane. After successful placement in a single frame, we use homography-based, shape-preserving tracking such that the advertisement appears perspective correct for the duration of a video clip. The tracker is designed to handle smooth camera motion and shot boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper describes an automated pipeline for markerless augmented advertising in sports videos. It identifies textures of interest, builds a 3D representation of the scene, places the advertisement in 3D, projects it back onto the image plane, and uses homography-based, shape-preserving tracking to maintain perspective-correct placement across frames while handling smooth camera motion and shot boundaries, without requiring camera intrinsics or markers. The goal is seamless integration to eliminate the need for commercial breaks.

Significance. If the described pipeline achieves reliable aesthetic placement and seamless tracking, it could have practical impact on live sports broadcasting by enabling non-intrusive ad augmentation. The approach targets a real-world challenge in markerless AR under unconstrained capture conditions. However, the complete absence of quantitative results, error metrics, or validation experiments makes it impossible to assess whether the claims hold or how the system performs relative to existing methods.

major comments (2)
  1. [Abstract] Abstract: The pipeline is described at a high level but the manuscript provides no quantitative results, error metrics, validation experiments, or implementation details to support the claims of aesthetic/natural placement or seamless integration.
  2. [Abstract] Abstract: The central assumption that homography-based tracking maintains perspective-correct placement across clips despite smooth camera motion and shot boundaries (without camera intrinsics) is stated without evidence, discussion of failure modes (e.g., non-planar surfaces or depth variation), or any supporting experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript describing the markerless augmented advertising pipeline. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The pipeline is described at a high level but the manuscript provides no quantitative results, error metrics, validation experiments, or implementation details to support the claims of aesthetic/natural placement or seamless integration.

    Authors: We acknowledge that the manuscript presents the pipeline at a conceptual level. The work emphasizes the overall architecture for texture identification, 3D placement, and homography tracking in unconstrained sports video without requiring camera intrinsics or markers. To address this, the revised manuscript will incorporate additional implementation details and qualitative results from example sequences demonstrating aesthetic placement and tracking across frames. Quantitative error metrics are not included in the original submission as the focus is on system design rather than comparative benchmarking; we will discuss potential evaluation strategies in the revision. revision: partial

  2. Referee: [Abstract] Abstract: The central assumption that homography-based tracking maintains perspective-correct placement across clips despite smooth camera motion and shot boundaries (without camera intrinsics) is stated without evidence, discussion of failure modes (e.g., non-planar surfaces or depth variation), or any supporting experiments.

    Authors: The homography tracking is applied under the assumption that the region of interest (e.g., sports field) can be treated as approximately planar, which holds for many broadcast sports scenarios. We will revise the manuscript to include an explicit discussion of this assumption, potential failure cases such as non-planar surfaces or large depth variations, and the method for detecting and handling shot boundaries during tracking. This will provide a more balanced analysis of the approach's scope and limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a high-level system description of an automated video augmentation pipeline for markerless AR advertising in sports videos. It covers texture identification, 3D scene building, ad placement in 3D, projection to image plane, and homography-based tracking, but contains no equations, derivations, fitted parameters, predictions, or first-principles results. No self-citations, uniqueness theorems, or ansatzes are invoked in any load-bearing mathematical sense. The work is a descriptive pipeline architecture with no derivation chain that could reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or ad-hoc axioms are stated. The approach relies on standard domain assumptions in computer vision about scene geometry and tracking.

axioms (1)
  • domain assumption Homography-based tracking suffices to handle smooth camera motion and shot boundaries while preserving shape and perspective.
    Invoked in the abstract as the method for maintaining placement across frames.

pith-pipeline@v0.9.0 · 5723 in / 1207 out tokens · 38288 ms · 2026-05-24T18:10:41.380721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talw...

  2. [2]

    In: Proc

    Alcantarilla, P.F., Bartoli, A., Davison, A.J.: Kaze features. In: Proc. of the 12th European Conf. on Computer Vision (ECCV). ECCV’12, vol. 4, pp. 214–

  3. [3]

    Scalable Funding of Bitcoin Micropayment Channel Networks

    Springer-Verlag, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3- 642-33783-3 16

  4. [4]

    Computer Vision Image Understanding 110(3), 346–359 (jun 2008)

    Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Computer Vision Image Understanding 110(3), 346–359 (jun 2008). https://doi.org/10.1016/j.cviu.2007.09.014

  5. [5]

    IEEE Trans

    Canny, J.: A computational approach to edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986)

  6. [6]

    Chang, C.H., Hsieh, K.Y., Chiang, M.C., Wu, J.L.: Virtual spotlighted advertising for tennis videos. J. Visual Commun. and Image Representation21, 595–612 (2010) Markerless Sports Advertising 15

  7. [7]

    In: Proc

    Chang, C.H., Hsieh, K.Y., Chung, M.C., Wu, J.L.: Visa: Virtual spotlighted adver- tising. In: Proc. of the 16th ACM Int. Conf. on Multimedia. pp. 837–840 (2008). https://doi.org/10.1145/1459359.1459500

  8. [8]

    In: BigLearn, NIPS Workshop (2011)

    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)

  9. [9]

    In: Proc

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2016)

  10. [10]

    Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972). https://doi.org/10.1145/361237.361242

  11. [11]

    IEEE Robotics Automation Magazine 13, 99 – 110 (2006)

    Durrant-whyte, H., Bailey, T.: Simultaneous localization and mapping: Part i. IEEE Robotics Automation Magazine 13, 99 – 110 (2006). https://doi.org/10.1109/MRA.2006.1638022

  12. [12]

    Network Theory Limited (2002)

    Eaton, J.W.: GNU Octave Manual. Network Theory Limited (2002)

  13. [13]

    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results (2012), http: //host.robots.ox.ac.uk/pascal/VOC/voc2012/

  14. [14]

    Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981). https://doi.org/10.1145/358669.358692

  15. [15]

    In: Multimedia Content Analysis and Mining

    Han, J., de With, P.H.N.: 3-d camera modeling and its applications in sports broadcast video analysis. In: Multimedia Content Analysis and Mining. pp. 434–

  16. [16]

    Springer Berlin Heidelberg, Berlin, Heidelberg (2007)

  17. [17]

    Cam- bridge University Press, New York, NY, USA, 2 edn

    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam- bridge University Press, New York, NY, USA, 2 edn. (2003)

  18. [18]

    Kalman, R.: A new approach to linear filtering and prediction problems. J. of Basic Engineering (ASME) 82D, 35–45 (01 1960)

  19. [19]

    In: Advances in Visual Computing

    Li, B., Peng, K., Ying, X., Zha, H.: Simultaneous vanishing point detection and camera calibration from single images. In: Advances in Visual Computing. pp. 151–160. Springer Berlin Heidelberg (2010)

  20. [20]

    In: Proc

    Li, Y., Wan, K.W., Yan, X., Xu, C.: Real time advertisement insertion in baseball video based on advertisement effect. In: Proc. of the 13th Annual ACM Int. Conf. on Multimedia. pp. 343–346 (2005). https://doi.org/10.1145/1101149.1101221

  21. [21]

    In: Proc

    Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2018)

  22. [22]

    16th IEEE Int

    Liu, H., Qiu, X., Huang, Q., Jiang, S., Xu, C.: Advertise gently - in-image adver- tising with low intrusiveness. 16th IEEE Int. Conf. on Image Process. (ICIP) pp. 3105–3108 (2009)

  23. [23]

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Computer Vision 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94

  24. [24]

    In: Proc

    Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli- cation to stereo vision. In: Proc. of the 7th Int. Joint Conf. on Artificial Intelligence. IJCAI’81, vol. 2, pp. 674–679. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1981)

  25. [25]

    In: Multimedia Commun

    Medioni, G., Guy, G., Rom, H., Fran¸ cois, A.: Real-time billboard substitution in a video stream. In: Multimedia Commun. pp. 71–84. Springer London (1999) 16 H. E. Wong et al

  26. [26]

    Multimedia Syst

    Mei, T., Guo, J., Hua, X.S., Liu, F.: Adon: Toward contextual overlay in-video advertising. Multimedia Syst. 16(4-5), 335–344 (2010)

  27. [27]

    In: Proc

    Mei, T., Hua, X.S., Li, S.: Contextual in-image advertising. In: Proc. of the 16th ACM Int. Conf. on Multimedia. pp. 439–448. ACM (2008). https://doi.org/10.1145/1459359.1459418

  28. [28]

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. Int J. of Computer Vision (IJCV) 115(3), 211–252 (2015)

  29. [29]

    In: Proc

    Sturm, P., Triggs, B.: A factorization based algorithm for multi-image projective structure and motion. In: Proc. of the 4th European Conf. on Computer Vision (ECCV). pp. 709–720. ECCV ’96, Springer Berlin Heidelberg, Berlin, Heidelberg (1996)

  30. [30]

    2006 IEEE Int

    Wan, K.W., Xu, C.: Automatic content placement in sports highlights. 2006 IEEE Int. Conf. on Multimedia and Expo pp. 1893–1896 (2006)

  31. [31]

    Xu, C., Wan, K.W., Bui, S.H., Tian, Q.: Implanting virtual advertisement into broadcast soccer video. In: Adv. in Multimedia Inf. Process. - PCM 2004. pp. 264–271. Springer Berlin Heidelberg (2005)

  32. [32]

    https://github.com/yasinyildirim/ShotDetection (2015)

    Yildrim, Y.: Shotdetection. https://github.com/yasinyildirim/ShotDetection (2015)

  33. [33]

    In: Proc

    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)

  34. [34]

    In: Proc

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)

  35. [35]

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ade20k dataset. Int. J. of Computer Vision (2018). https://doi.org/10.1007/s11263-018-1140-0