pith. sign in

arxiv: 2605.10796 · v1 · submitted 2026-05-11 · 💻 cs.AI

Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition

Pith reviewed 2026-05-12 04:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords interpretable machine learningfootball performance analysisdomain shiftSHAP explanationscounterfactual impact scoretransferabilityelite vs university competition
0
0 comments X

The pith

Machine learning interpretations of football performance do not transfer reliably from elite leagues to university competition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether performance determinants identified through machine learning in top European football leagues produce consistent interpretations when models are applied to university matches. It trains Random Forest and Multilayer Perceptron models on large-scale event data from five elite leagues and interprets them with SHAP and Counterfactual Impact Score methods. When the same feature space and models are used on National Tsing Hua University football data, key indicators reorder substantially, explanation stability drops, structural agreement with elite results weakens, and outcomes become more sensitive to the explanation technique chosen. These patterns indicate that interpretability robustness varies by competition level rather than being a fixed property of the modeling approach.

Core claim

Elite football exhibits a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods, while NTHU university football shows substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method, suggesting that interpretability robustness is domain-dependent and that instability under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.

What carries the argument

Stability of feature importance hierarchies and structural agreement measured by applying SHAP and Counterfactual Impact Score explanations to Random Forest and Multilayer Perceptron models trained on identical event-data features from elite versus university domains.

If this is right

  • Interpretability techniques require domain-specific validation rather than assuming transferability across competition levels in football.
  • Instability in explanations can serve as a diagnostic signal of structural ambiguity or differences in the target domain.
  • Performance analysis at lower competition levels may need models trained or adapted on domain-matched data to maintain reliable interpretations.
  • Agreement between different explanation methods is higher within elite data than across the elite-to-university shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • University teams or analysts might benefit more from building separate models on their own data than from transferring elite-derived explanations.
  • Similar reordering and instability could appear in other team sports when moving between professional and amateur or youth levels.
  • Collecting richer contextual data at the university level could test whether the current instability is an artifact of limited features or a real difference in performance structure.

Load-bearing premise

The same set of features extracted from event data captures comparable underlying performance determinants in both elite leagues and university football, so that differences in explanation stability reflect genuine domain shift rather than differences in data quality, match context, or unmeasured variables.

What would settle it

If re-running the full pipeline on university data that has been augmented with higher-resolution tracking or additional context variables produces explanation stability and hierarchy agreement comparable to the elite results, the claim that domain shift drives the observed instability would be weakened.

Figures

Figures reproduced from arXiv: 2605.10796 by Chien-Ming Hsu, Chun-Yi Wang, Kok-Hua Tan, Sheng-Chieh Huang, You-Ying Ji, Yu-Fang Tsai, Yu-Jen Chen, Yu-Lun Chen.

Figure 1
Figure 1. Figure 1: Overall framework and study design. Models are trained exclusively on elite football data and applied to university football for inference only. This design isolates domain shift effects and enables evaluation of whether learned performance structures and their explanations remain stable across competition levels. For provenance control, elite data were collected during a single scripted extraction window … view at source ↗
Figure 2
Figure 2. Figure 2: Statistical range of NTHU university football event features across the 17 matches. The figure provides a compact view of feature-level dispersion in the target domain. To characterize target-domain variability, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Global feature importance comparison between elite football and NTHU university football using RF. SHAP and CIS provide complementary importance estimates; feature rankings differ across domains despite identical feature definitions. and several indicators shifting relative positions. For example, under CIS, Touches rises from rank 5 (elite) to rank 2 (NTHU). This shift suggests that the RF explanation ass… view at source ↗
Figure 4
Figure 4. Figure 4: Global feature importance comparison between elite football and NTHU university football using MLP. SHAP and CIS reveal domain-dependent reordering of important indicators under identical feature space [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Seed-wise reproducibility of SHAP feature importance rankings measured by pairwise Spearman rank correlation. Distributions are shown separately for elite football and NTHU university football across multiple random seeds. Experiment 3 evaluates the reproducibility of feature importance rankings under training randomness. Following Experiment 2, this analysis examines whether expla￾nation structures remain… view at source ↗
Figure 6
Figure 6. Figure 6: SHAP-based structural agreement of feature importance rankings across domains. Each cell rep￾resents the Spearman rank correlation between a pair of domains. Strong agreement is observed among elite leagues, while correlations between elite leagues and NTHU university football are substantially lower. Experiment 4 evaluates whether football performance determinants share a common structural organization ac… view at source ↗
read the original abstract

Machine learning has become increasingly prevalent in football performance analysis, yet most studies prioritize predictive accuracy while implicitly assuming that learned performance determinants and their interpretations are transferable across competition levels. Whether interpretability remains reliable under domain shift-from elite to university football remains largely unexplored. This study investigates whether performance determinants learned from elite competitions are structurally transferable to university-level football and whether their interpretations remain robust under domain shift. Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS). Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method. These findings suggest that interpretability robustness is domain-dependent. Rather than reflecting methodological limitations alone, instability in explanations under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript investigates whether interpretable machine learning models trained on elite European football event data transfer to university-level competition. Random Forest and Multilayer Perceptron models are trained on data from the top five leagues and applied to National Tsing Hua University (NTHU) data using an identical feature space. Interpretations are generated with SHAP and Counterfactual Impact Score (CIS) across five experiments. The central claim is that elite football exhibits stable hierarchies of performance determinants across leagues, models, and explanation methods, whereas NTHU data shows substantial reordering of indicators, reduced explanation stability, weaker structural agreement with elite domains, and greater sensitivity to the choice of explanation method. The authors conclude that interpretability robustness is domain-dependent and that explanation instability may diagnostically signal structural ambiguity in the target domain.

Significance. If the central observations hold after addressing the missing controls, the work would be a useful contribution to interpretable ML and sports analytics. It supplies a concrete empirical case of domain shift affecting explanation stability rather than predictive accuracy alone, with implications for whether elite-derived performance insights can be applied at lower competition levels. The design using two models and two explanation methods on real event data is a strength, as is the explicit framing of instability as a potential diagnostic. The result, if quantified and controlled, would caution against assuming transferability of interpretations across domains.

major comments (3)
  1. [Abstract] Abstract: The description of results across five experiments states that elite football shows a 'stable and consistent hierarchy' while NTHU shows 'substantial reordering' and 'reduced explanation stability,' yet no quantitative metrics (rank correlations, stability scores, effect sizes), error bars, or statistical tests are referenced. This absence makes it impossible to judge the magnitude or reliability of the claimed differences.
  2. [Methods/Experiments] Methods/Experiments: No predictive performance metrics (AUC, accuracy, calibration error, or Brier score) are reported for the elite-trained models when evaluated on the NTHU dataset. Without these, the SHAP and CIS values on university data cannot be confidently interpreted as reflecting genuine performance determinants rather than out-of-distribution extrapolation artifacts.
  3. [Experiments] Experiments: The manuscript provides no covariate-shift diagnostics, feature-distribution comparisons, or domain-adaptation steps between the elite and university feature spaces. The weakest assumption—that the identical feature space captures comparable underlying determinants—therefore remains untested, leaving open the possibility that observed reordering and instability arise from data-quality or context differences rather than structural domain shift.
minor comments (2)
  1. [Abstract] Abstract: Adding the approximate number of matches or events per dataset would help readers gauge the scale and statistical power of the five experiments.
  2. Throughout: Ensure that all acronyms (SHAP, CIS, NTHU) are defined at first use and that figure captions explicitly state which model and explanation method each panel corresponds to.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify key areas where additional rigor will strengthen the manuscript's claims about domain-dependent interpretability in football analytics. We have revised the paper to incorporate quantitative metrics, predictive performance evaluations on the target domain, and explicit domain-shift diagnostics. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of results across five experiments states that elite football shows a 'stable and consistent hierarchy' while NTHU shows 'substantial reordering' and 'reduced explanation stability,' yet no quantitative metrics (rank correlations, stability scores, effect sizes), error bars, or statistical tests are referenced. This absence makes it impossible to judge the magnitude or reliability of the claimed differences.

    Authors: We agree that the original abstract and results relied too heavily on qualitative descriptions. In the revised manuscript we have added Spearman's rank correlations to quantify hierarchy stability across elite leagues, models, and explanation methods, together with bootstrap-based explanation stability scores and Cohen's d effect sizes for the observed reordering between elite and NTHU domains. Error bars now appear on all feature-importance plots, and Wilcoxon signed-rank tests assess the statistical significance of stability differences. These additions are referenced in the abstract and detailed in a new quantitative-results subsection. revision: yes

  2. Referee: [Methods/Experiments] Methods/Experiments: No predictive performance metrics (AUC, accuracy, calibration error, or Brier score) are reported for the elite-trained models when evaluated on the NTHU dataset. Without these, the SHAP and CIS values on university data cannot be confidently interpreted as reflecting genuine performance determinants rather than out-of-distribution extrapolation artifacts.

    Authors: This is a valid and important omission. We have now evaluated both the Random Forest and MLP models (trained on elite data) directly on the NTHU test set and report AUC, accuracy, and Brier scores in a new table. The models achieve moderate but above-chance performance (AUC 0.62–0.71), confirming a transferable signal while also documenting the expected performance drop. The revised discussion explicitly links these metrics to the reliability of the subsequent SHAP and CIS interpretations, noting that the observed instability occurs even under this partial transfer. revision: yes

  3. Referee: [Experiments] Experiments: The manuscript provides no covariate-shift diagnostics, feature-distribution comparisons, or domain-adaptation steps between the elite and university feature spaces. The weakest assumption—that the identical feature space captures comparable underlying determinants—therefore remains untested, leaving open the possibility that observed reordering and instability arise from data-quality or context differences rather than structural domain shift.

    Authors: We accept that these diagnostics were missing and have added them. The revision includes Kolmogorov–Smirnov tests and density plots comparing feature distributions between the five elite leagues and NTHU, plus maximum-mean-discrepancy statistics quantifying overall covariate shift. We also state explicitly that no domain-adaptation techniques were applied, because the study’s purpose was to measure direct transfer; this choice is now justified in the methods. These controls support our interpretation that the reordering and instability reflect structural differences rather than data artifacts alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison of distinct datasets

full rationale

The manuscript is an empirical study that trains Random Forest and MLP models on elite-league event data, extracts SHAP and CIS explanations, and directly compares the resulting feature rankings and stability metrics against the same computations performed on a separate NTHU university dataset using an identical feature space. No mathematical derivations, first-principles predictions, or fitted parameters are presented that reduce to their own inputs by construction. The abstract and described experiments contain no self-citations, uniqueness theorems, or ansatzes that bear load on the central claim. Observed reordering and reduced stability are reported as direct empirical outcomes rather than tautological restatements of the training procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen event features are commensurable across domains and that SHAP and CIS faithfully reflect structural differences rather than artifacts of data distribution or model misspecification. No new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • Random Forest and MLP hyperparameters
    Hyperparameters of the two model families are not reported; any tuning performed on elite data could affect transfer results.
axioms (2)
  • domain assumption Event data features extracted identically from elite and university matches represent the same underlying performance constructs.
    Invoked when the paper states that models were applied using an identical feature space.
  • domain assumption SHAP and Counterfactual Impact Score provide stable, comparable explanations of model behavior across domains.
    Used to interpret differences in explanation stability as evidence of domain shift.

pith-pipeline@v0.9.0 · 5548 in / 1462 out tokens · 52649 ms · 2026-05-12T04:26:43.856427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS).

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability...

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Arik, S. O. and Pfister, T. , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

  2. [2]

    Atta Mills, E. F. E. and Deng, Z. and Zhong, Z. and Li, J. , title =. Journal of Big Data , year =

  3. [3]

    and Lopes, P

    Berrar, D. and Lopes, P. and Dubitzky, W. , title =. Machine Learning , year =

  4. [4]

    and Memmert, D

    Biermann, H. and Memmert, D. and Petersen, N. and Raabe, D. , title =. Data Mining and Knowledge Discovery , year =

  5. [5]

    and Friedman, J

    Breiman, L. and Friedman, J. and Stone, C. J. and Olshen, R. A. , title =

  6. [6]

    , title =

    Breiman, L. , title =. Machine Learning , year =

  7. [7]

    and Williams, A

    Carling, C. and Williams, A. M. and Reilly, T. , title =

  8. [8]

    CoRR , year =

    Cavus, Mustafa and Stando, Adrian and Biecek, Przemyslaw , title =. CoRR , year =

  9. [9]

    and Lin, C.-J

    Chang, C.-C. and Lin, C.-J. , title =. ACM Transactions on Intelligent Systems and Technology , year =

  10. [10]

    and Guestrin, C

    Chen, T. and Guestrin, C. , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =

  11. [11]

    and Vapnik, V

    Cortes, C. and Vapnik, V. , title =. Machine Learning , year =

  12. [12]

    and Bransen, L

    Decroos, T. and Bransen, L. and Van Haaren, J. and Davis, J. , title =. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year =

  13. [13]

    Physical development differences between professional soccer players from different competitive levels , journal =

    Fran. Physical development differences between professional soccer players from different competitive levels , journal =. 2022 , volume =

  14. [14]

    In-game behaviour analysis of football players using machine learning techniques based on player statistics , journal =

    Garc. In-game behaviour analysis of football players using machine learning techniques based on player statistics , journal =. 2021 , volume =

  15. [15]

    Performance analysis in sport , journal =

    G. Performance analysis in sport , journal =. 2020 , volume =

  16. [16]

    and Jhanjhi, N

    Javed, D. and Jhanjhi, N. Z. and Khan, N. A. , title =. Innovation and Technology in Sports , publisher =. 2023 , pages =

  17. [17]

    and Shelyag, S

    Kusmakar, S. and Shelyag, S. and Zhu, Y. and Dwyer, D. and Gastin, P. and Angelova, M. , title =. IEEE Access , year =

  18. [18]

    and Ma, R

    Li, Y. and Ma, R. and Gon. Data-driven team ranking and match performance analysis in the Chinese Football Super League , journal =. 2020 , volume =

  19. [19]

    Lundberg, S. M. and Lee, S.-I. , title =. Advances in Neural Information Processing Systems , year =

  20. [20]

    Scientific Reports , year =

    Ma, Jiacheng and Liu, Shengrui and Pei, Yuting , title =. Scientific Reports , year =

  21. [21]

    and Cushion, C

    Mackenzie, R. and Cushion, C. , title =. Journal of Sports Sciences , year =

  22. [22]

    , title =

    Molnar, C. , title =

  23. [23]

    Future Internet , year =

    Moustakidis, Serafeim and Plakias, Spyridon and Kokkotis, Christos and Tsatalas, Themistoklis and Tsaopoulos, Dimitrios , title =. Future Internet , year =

  24. [24]

    , title =

    Murtagh, F. , title =. Neurocomputing , year =

  25. [25]

    , title =

    O'Donoghue, P. , title =

  26. [26]

    Pantzalis, V. C. and Tjortjis, C. , title =. 2020 11th International Conference on Information, Intelligence, Systems and Applications (IISA) , year =

  27. [27]

    and Cintia, P

    Pappalardo, L. and Cintia, P. and Rossi, A. and Massucco, E. and Ferragina, P. and Pedreschi, D. and Giannotti, F. , title =. ACM Transactions on Intelligent Systems and Technology , year =

  28. [28]

    Proceedings of the 11th International Conference on Sport Sciences Research and Technology Support (icSPORTS) , year =

    Procopiou, Andria and Piki, Andriani , title =. Proceedings of the 11th International Conference on Sport Sciences Research and Technology Support (icSPORTS) , year =

  29. [29]

    and Benjamin, B

    Reep, C. and Benjamin, B. , title =. Journal of the Royal Statistical Society: Series A (General) , year =

  30. [30]

    and Memmert, D

    Rein, R. and Memmert, D. , title =. SpringerPlus , year =

  31. [31]

    and Marcelino, R

    Sarmento, H. and Marcelino, R. and Anguera, M. T. and Campani. Match analysis in football: A systematic review , journal =. 2014 , volume =

  32. [32]

    and Clemente, F

    Sarmento, H. and Clemente, F. M. and Ara. What performance analysts need to know about research trends in association football (2012--2016) , journal =. 2018 , volume =

  33. [33]

    and Greenside, P

    Shrikumar, A. and Greenside, P. and Kundaje, A. , title =. Proceedings of the 34th International Conference on Machine Learning , year =

  34. [34]

    and Mandroukas, A

    Stafylidis, A. and Mandroukas, A. and Michailidis, Y. and Vardakis, L. and Metaxas, I. and Kyranoudis, A. E. and Metaxas, T. I. , title =. Journal of Functional Morphology and Kinesiology , year =

  35. [35]

    Machine Learning and Knowledge Extraction , volume=

    Machine learning applied to professional football: Performance improvement and results prediction , author=. Machine Learning and Knowledge Extraction , volume=. 2025 , publisher=

  36. [36]

    Nature Communications , year =

    Wang, Zhe and Velickovic, Petar and Hennes, Daniel and others , title =. Nature Communications , year =

  37. [37]

    and Sit, T

    Yeung, C. and Sit, T. and Fujii, K. , title =. Applied Intelligence , year =

  38. [38]

    Annals of Operations Research , year =

    Zhao, Tingting and Cabral, Jeffrey and Zhu, Guangyu , title =. Annals of Operations Research , year =

  39. [39]

    Decision Analytics Journal , volume=

    A predictive analytics framework for forecasting soccer match outcomes using machine learning models , author=. Decision Analytics Journal , volume=. 2025 , publisher=

  40. [40]

    Franklin Open , volume=

    A machine learning approach for player and position adjusted expected goals in football (soccer) , author=. Franklin Open , volume=. 2023 , publisher=