pith. sign in

arxiv: 1907.01058 · v1 · pith:Q4A3O64Cnew · submitted 2019-07-01 · 💻 cs.CV

Associative Embedding for Game-Agnostic Team Discrimination

Pith reviewed 2026-05-25 11:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords team discriminationassociative embeddingplayer segmentationsports analyticsCNNgeneralizationpixel embeddingbasketball
0
0 comments X

The pith

A CNN produces pixel embeddings that group same-team players across entirely new games without retraining or appearance modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a lightweight segmentation network can be trained once to output pixel-wise embedding vectors that are similar for all players on one team and dissimilar for the other team. Because the embeddings are learned to be identical for unconnected pixels belonging to different players of the same team, team labels can be read off immediately at the start of any new game. The approach adapts the associative embedding idea to the team-discrimination setting and is tested on panoramic basketball footage containing occlusions and interactions. If the claim holds, sport-analytics pipelines can add team separation as a plug-in module rather than training a new classifier per arena or per season.

Core claim

The central claim is that associative embeddings derived from a segmentation network can assign the same embedding vector to pixels of distinct players who belong to the same team, enabling accurate team discrimination on unseen games and arenas without any per-game fine-tuning or explicit appearance modeling.

What carries the argument

Pixel-wise embedding vectors produced by a lightweight segmentation CNN that are forced to be identical for all pixels of one team and different for the opposing team.

If this is right

  • Team labels become available from the first frame of any new game.
  • The same trained network can be dropped into multiple sport-analytics pipelines without additional learning.
  • Occlusions and player interactions do not break the embedding-based separation on the tested panoramic basketball views.
  • No hand-crafted color or appearance features are required once the network is trained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding principle could be tested on other team sports that share similar player-interaction patterns.
  • If the embeddings remain stable across camera angles, they might also support tracking consistency without explicit re-identification modules.
  • Extending the loss to enforce embedding constancy across short time windows could reduce label flicker in video sequences.

Load-bearing premise

Training footage from a variety of basketball games already contains enough visual diversity that the learned embeddings transfer directly to entirely new arenas and camera setups.

What would settle it

Measure team-label accuracy on a held-out set of games filmed in a previously unseen arena with different lighting, court markings, and uniform styles; if accuracy falls below the reported level on the training distribution, the generalization claim is false.

Figures

Figures reproduced from arXiv: 1907.01058 by Christophe De Vleeschouwer, Julien Moreau, Maxime Istasse.

Figure 1
Figure 1. Figure 1: Overview of our architecture. ICNet [38] is used as backbone for following assets: pixel-wise segmentation, combination of three scales to encode global and local features, fast ([38] reaches 30 FPS at 1024 × 2048 resolution). Its last convolution is modified to output a segmentation mask along with vector embeddings in each pixel. We keep the multi￾scale supervision for the segmentation and add Lpush and … view at source ↗
Figure 2
Figure 2. Figure 2: Team discrimination with associative embedding. From left to right: test image, zoomed reference masks and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Assigning team labels to players in a sport game is not a trivial task when no prior is known about the visual appearance of each team. Our work builds on a Convolutional Neural Network (CNN) to learn a descriptor, namely a pixel-wise embedding vector, that is similar for pixels depicting players from the same team, and dissimilar when pixels correspond to distinct teams. The advantage of this idea is that no per-game learning is needed, allowing efficient team discrimination as soon as the game starts. In principle, the approach follows the associative embedding framework introduced in arXiv:1611.05424 to differentiate instances of objects. Our work is however different in that it derives the embeddings from a lightweight segmentation network and, more fundamentally, because it considers the assignment of the same embedding to unconnected pixels, as required by pixels of distinct players from the same team. Excellent results, both in terms of team labelling accuracy and generalization to new games/arenas, have been achieved on panoramic views of a large variety of basketball games involving players interactions and occlusions. This makes our method a good candidate to integrate team separation in many CNN-based sport analytics pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes adapting the associative embedding framework to learn pixel-wise descriptors from a lightweight segmentation CNN for assigning team labels to players in basketball games. The embeddings are trained to be similar for pixels of players from the same team (even if unconnected) and dissimilar for different teams, with the goal of enabling immediate, game-agnostic team discrimination without per-game fine-tuning, appearance models, or prior knowledge of team visuals. The abstract asserts excellent results on team labelling accuracy and generalization to new games/arenas using panoramic views of diverse basketball games involving interactions and occlusions.

Significance. If the quantitative claims hold with appropriate controls, the approach could provide a practical, training-free component for sport analytics pipelines that rely on player tracking and team separation. The adaptation of associative embeddings to group unconnected same-team pixels is a direct and reasonable extension of the cited prior work. However, the absence of any reported metrics, baselines, dataset sizes, or protocol details prevents assessment of whether the generalization actually factors out game-specific appearance or merely exploits shared visual patterns in the training distribution.

major comments (1)
  1. [Abstract] Abstract: the central claim of 'excellent results, both in terms of team labelling accuracy and generalization to new games/arenas' is asserted without any quantitative metrics, baselines, error bars, dataset sizes, train/test split details, or experimental protocol. This directly undermines verification of the load-bearing generalization assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment. We address the concern regarding the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'excellent results, both in terms of team labelling accuracy and generalization to new games/arenas' is asserted without any quantitative metrics, baselines, error bars, dataset sizes, train/test split details, or experimental protocol. This directly undermines verification of the load-bearing generalization assertion.

    Authors: We agree that the abstract makes a qualitative claim without supporting numbers, which limits immediate verifiability. The body of the manuscript reports the experimental results on team labelling accuracy and generalization across games. In the revised version we will update the abstract to include the key quantitative figures (accuracy on held-out games, number of training and test games, and a concise protocol summary) so that the generalization claim can be assessed directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: adapts external associative embedding framework to team discrimination task with independent empirical validation.

full rationale

The paper's core method follows the associative embedding framework from the external citation arXiv:1611.05424 (Newell et al.), with explicit modifications including use of a lightweight segmentation network and assignment of embeddings to unconnected pixels for same-team players. No equations, loss functions, or claims in the provided text reduce by construction to fitted parameters or results defined in the authors' own prior work. Generalization to new games is asserted via empirical results on held-out basketball data rather than by definitional equivalence or self-citation chains. The cited prior work is independent (no author overlap indicated), and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; the method appears to rely on standard supervised CNN training whose specifics are not described.

pith-pipeline@v0.9.0 · 5732 in / 1134 out tokens · 37052 ms · 2026-05-25T11:38:46.259618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Alahi, L

    A. Alahi, L. Jacques, Y . Boursier, and P. Vandergheynst. Sparsity driven people localization with a heterogeneous net- work of cameras. Journal of Mathematical Imaging and Vi- sion, 41(1):39–58, Sep 2011. 7

  2. [2]

    Badrinarayanan, A

    V . Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, Dec 2017. 2

  3. [3]

    Bialkowski, P

    A. Bialkowski, P. Lucey, P. Carr, S. Sridharan, and I. Matthews. Representing Team Behaviours from Noisy Data Using Player Role , pages 247–269. Springer Interna- tional Publishing, Cham, 2014. 1

  4. [4]

    D. M. Blei and M. I. Jordan. Variational inference for dirich- let process mixtures. Bayesian Anal. , 1(1):121–143, 03

  5. [5]

    P. Carr, Y . Sheikh, and I. Matthews. Monocular object de- tection using 3d geometric primitives. In Computer Vision – ECCV 2012 , pages 864–878, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. 7

  6. [6]

    J. Chen, H. M. Le, P. Carr, Y . Yue, and J. J. Little. Learning online smooth predictors for realtime camera planning using recurrent decision trees. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2016. 1

  7. [7]

    Cioppa, A

    A. Cioppa, A. Deliege, and M. Van Droogenbroeck. A bottom-up approach based on semantics for the interpreta- tion of the main camera stream in soccer games. In The IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) Workshops, June 2018. 1, 2, 6

  8. [8]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 5

  9. [9]

    Delannay, N

    D. Delannay, N. Danhier, and C. De Vleeschouwer. Detec- tion and recognition of sports(wo)men from multiple views. In 2009 Third ACM/IEEE International Conference on Dis- tributed Smart Cameras (ICDSC), pages 1–7, Aug 2009. 7

  10. [10]

    D’Orazio, M

    T. D’Orazio, M. Leo, P. Spagnolo, P. L. Mazzeo, N. Mosca, M. Nitti, and A. Distante. An investigation into the feasi- bility of real-time soccer offside detection from a multiple camera system. IEEE Transactions on Circuits and Systems for Video Technology, 19(12):1804–1818, 2009. 1, 6

  11. [11]

    Gerke, A

    S. Gerke, A. Linnemann, and K. Mller. Soccer player recog- nition using spatial constellation features and jersey num- ber recognition. Computer Vision and Image Understanding, 159:105 – 115, 2017. Computer Vision in Sports. 2

  12. [12]

    Hobbs, P

    J. Hobbs, P. Power, L. Sha, and P. Lucey. Quantifying the value of transitions in soccer via spatiotemporal trajectory clustering. In MIT Sloan Sports Analytics Conference, 2018. 1

  13. [13]

    Kendall, Y

    A. Kendall, Y . Gal, and R. Cipolla. Multi-task learning us- ing uncertainty to weigh losses for scene geometry and se- mantics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 7

  14. [14]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015. 4

  15. [15]

    Law and J

    H. Law and J. Deng. CornerNet: detecting objects as paired keypoints. In The European Conference on Computer Vision (ECCV), September 2018. 3, 4, 7

  16. [16]

    Liu and P

    J. Liu and P. Carr. Detecting and Tracking Sports Play- ers with Random Forests and Context-Conditioned Motion Models, pages 113–132. Springer International Publishing, Cham, 2014. 1

  17. [17]

    K. Lu, J. Chen, J. J. Little, and H. He. Lightweight convo- lutional neural networks for player detection and classifica- tion. Computer Vision and Image Understanding, 172:77 – 87, 2018. 1, 2, 7

  18. [18]

    Lu, J.-A

    W.-L. Lu, J.-A. Ting, J. J. Little, and K. P. Murphy. Learning to track and identify players from broadcast sports videos. IEEE transactions on pattern analysis and machine intelli- gence, 35(7):1704–1716, 2013. 1, 6

  19. [19]

    Manafifard, H

    M. Manafifard, H. Ebadi, and H. Abrishami Moghaddam. A survey on player tracking in soccer videos. Computer Vision and Image Understanding , 159:19 – 46, 2017. Computer Vision in Sports. 1

  20. [20]

    D. Mazzini. Guided upsampling network for real-time se- mantic segmentation. In The British Machine Vision Confer- ence (BMVC), September 2018. 2

  21. [21]

    Newell and J

    A. Newell and J. Deng. Pixels to graphs by associative em- bedding. In Advances in Neural Information Processing Sys- tems 30, pages 2171–2180. Curran Associates, Inc., 2017. 2, 3, 4, 7

  22. [22]

    Associative Embedding: End-to-End Learning for Joint Detection and Grouping

    A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. CoRR, abs/1611.05424, 2017. 1, 2, 3, 4, 7

  23. [23]

    Parisot and C

    P. Parisot and C. De Vleeschouwer. Consensus-based trajec- tory estimation for ball detection in calibrated cameras sys- tems. Journal of Real-Time Image Processing, Sep 2016. 1

  24. [24]

    Parisot and C

    P. Parisot and C. De Vleeschouwer. Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera. Computer Vision and Image Un- derstanding, 159:74 – 88, 2017. Computer Vision in Sports. 1, 7

  25. [25]

    ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

    A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. CoRR, abs/1606.02147, 2016. 2

  26. [26]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma- chine learning in Python. Journal of Machine Learning Re- search, 12:2825–2830, 2011. 6

  27. [27]

    R. P. K. Poudel, U. Bonde, S. Liwicki, and C. Zach. Con- textNet: exploring context and detail for semantic segmenta- tion in real-time. In The British Machine Vision Conference (BMVC), September 2018. 2

  28. [28]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015. 2

  29. [29]

    M. P. Shah. Semantic segmentation architectures implemented in pytorch. https://github.com/meetshah1995/pytorch-semseg, 2017. 4

  30. [30]

    Thomas, R

    G. Thomas, R. Gade, T. B. Moeslund, P. Carr, and A. Hilton. Computer vision for sports: Current applications and re- search topics. Computer Vision and Image Understanding , 159:3 – 18, 2017. Computer Vision in Sports. 1

  31. [31]

    X. Tong, J. Liu, T. Wang, and Y . Zhang. Automatic player la- beling, tracking and field registration and trajectory mapping in broadcast soccer video. ACM Trans. Intell. Syst. Technol., 2(2):15:1–15:32, Feb. 2011. 1, 6

  32. [32]

    V ondrick, A

    C. V ondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy. Tracking emerges by colorizing videos. In V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, edi- tors, Computer Vision – ECCV 2018, pages 402–419, Cham,

  33. [33]

    Springer International Publishing. 3

  34. [34]

    X. Wei, L. Sha, P. Lucey, P. Carr, S. Sridharan, and I. Matthews. Predicting ball ownership in basketball from a monocular view using only player trajectories. In The IEEE International Conference on Computer Vision (ICCV) Work- shops, December 2015. 1

  35. [35]

    Z. Wu, C. Shen, and A. van den Hengel. Real-time se- mantic image segmentation via spatial sparsity. CoRR, abs/1712.00213, 2017. 2

  36. [36]

    F. Yang, W. Choi, and Y . Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June

  37. [37]

    C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. BiSeNet: bilateral segmentation network for real-time se- mantic segmentation. In The European Conference on Com- puter Vision (ECCV), September 2018. 2, 4

  38. [38]

    Yu and V

    F. Yu and V . Koltun. Multi-scale context aggregation by di- lated convolutions. In ICLR, 2016. 2

  39. [39]

    H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. ICNet for real-time semantic segmentation on high-resolution images. In The European Conference on Computer Vision (ECCV) , September 2018. 2, 3, 4, 5, 7

  40. [40]

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In The IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), July 2017. 2, 4

  41. [41]

    Zheng, Y

    S. Zheng, Y . Yue, and J. Hobbs. Generating long-term tra- jectories using deep hierarchical networks. In Advances in Neural Information Processing Systems , pages 1543–1551,