Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition
Pith reviewed 2026-05-12 04:26 UTC · model grok-4.3
The pith
Machine learning interpretations of football performance do not transfer reliably from elite leagues to university competition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Elite football exhibits a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods, while NTHU university football shows substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method, suggesting that interpretability robustness is domain-dependent and that instability under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.
What carries the argument
Stability of feature importance hierarchies and structural agreement measured by applying SHAP and Counterfactual Impact Score explanations to Random Forest and Multilayer Perceptron models trained on identical event-data features from elite versus university domains.
If this is right
- Interpretability techniques require domain-specific validation rather than assuming transferability across competition levels in football.
- Instability in explanations can serve as a diagnostic signal of structural ambiguity or differences in the target domain.
- Performance analysis at lower competition levels may need models trained or adapted on domain-matched data to maintain reliable interpretations.
- Agreement between different explanation methods is higher within elite data than across the elite-to-university shift.
Where Pith is reading between the lines
- University teams or analysts might benefit more from building separate models on their own data than from transferring elite-derived explanations.
- Similar reordering and instability could appear in other team sports when moving between professional and amateur or youth levels.
- Collecting richer contextual data at the university level could test whether the current instability is an artifact of limited features or a real difference in performance structure.
Load-bearing premise
The same set of features extracted from event data captures comparable underlying performance determinants in both elite leagues and university football, so that differences in explanation stability reflect genuine domain shift rather than differences in data quality, match context, or unmeasured variables.
What would settle it
If re-running the full pipeline on university data that has been augmented with higher-resolution tracking or additional context variables produces explanation stability and hierarchy agreement comparable to the elite results, the claim that domain shift drives the observed instability would be weakened.
Figures
read the original abstract
Machine learning has become increasingly prevalent in football performance analysis, yet most studies prioritize predictive accuracy while implicitly assuming that learned performance determinants and their interpretations are transferable across competition levels. Whether interpretability remains reliable under domain shift-from elite to university football remains largely unexplored. This study investigates whether performance determinants learned from elite competitions are structurally transferable to university-level football and whether their interpretations remain robust under domain shift. Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS). Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method. These findings suggest that interpretability robustness is domain-dependent. Rather than reflecting methodological limitations alone, instability in explanations under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates whether interpretable machine learning models trained on elite European football event data transfer to university-level competition. Random Forest and Multilayer Perceptron models are trained on data from the top five leagues and applied to National Tsing Hua University (NTHU) data using an identical feature space. Interpretations are generated with SHAP and Counterfactual Impact Score (CIS) across five experiments. The central claim is that elite football exhibits stable hierarchies of performance determinants across leagues, models, and explanation methods, whereas NTHU data shows substantial reordering of indicators, reduced explanation stability, weaker structural agreement with elite domains, and greater sensitivity to the choice of explanation method. The authors conclude that interpretability robustness is domain-dependent and that explanation instability may diagnostically signal structural ambiguity in the target domain.
Significance. If the central observations hold after addressing the missing controls, the work would be a useful contribution to interpretable ML and sports analytics. It supplies a concrete empirical case of domain shift affecting explanation stability rather than predictive accuracy alone, with implications for whether elite-derived performance insights can be applied at lower competition levels. The design using two models and two explanation methods on real event data is a strength, as is the explicit framing of instability as a potential diagnostic. The result, if quantified and controlled, would caution against assuming transferability of interpretations across domains.
major comments (3)
- [Abstract] Abstract: The description of results across five experiments states that elite football shows a 'stable and consistent hierarchy' while NTHU shows 'substantial reordering' and 'reduced explanation stability,' yet no quantitative metrics (rank correlations, stability scores, effect sizes), error bars, or statistical tests are referenced. This absence makes it impossible to judge the magnitude or reliability of the claimed differences.
- [Methods/Experiments] Methods/Experiments: No predictive performance metrics (AUC, accuracy, calibration error, or Brier score) are reported for the elite-trained models when evaluated on the NTHU dataset. Without these, the SHAP and CIS values on university data cannot be confidently interpreted as reflecting genuine performance determinants rather than out-of-distribution extrapolation artifacts.
- [Experiments] Experiments: The manuscript provides no covariate-shift diagnostics, feature-distribution comparisons, or domain-adaptation steps between the elite and university feature spaces. The weakest assumption—that the identical feature space captures comparable underlying determinants—therefore remains untested, leaving open the possibility that observed reordering and instability arise from data-quality or context differences rather than structural domain shift.
minor comments (2)
- [Abstract] Abstract: Adding the approximate number of matches or events per dataset would help readers gauge the scale and statistical power of the five experiments.
- Throughout: Ensure that all acronyms (SHAP, CIS, NTHU) are defined at first use and that figure captions explicitly state which model and explanation method each panel corresponds to.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which identify key areas where additional rigor will strengthen the manuscript's claims about domain-dependent interpretability in football analytics. We have revised the paper to incorporate quantitative metrics, predictive performance evaluations on the target domain, and explicit domain-shift diagnostics. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The description of results across five experiments states that elite football shows a 'stable and consistent hierarchy' while NTHU shows 'substantial reordering' and 'reduced explanation stability,' yet no quantitative metrics (rank correlations, stability scores, effect sizes), error bars, or statistical tests are referenced. This absence makes it impossible to judge the magnitude or reliability of the claimed differences.
Authors: We agree that the original abstract and results relied too heavily on qualitative descriptions. In the revised manuscript we have added Spearman's rank correlations to quantify hierarchy stability across elite leagues, models, and explanation methods, together with bootstrap-based explanation stability scores and Cohen's d effect sizes for the observed reordering between elite and NTHU domains. Error bars now appear on all feature-importance plots, and Wilcoxon signed-rank tests assess the statistical significance of stability differences. These additions are referenced in the abstract and detailed in a new quantitative-results subsection. revision: yes
-
Referee: [Methods/Experiments] Methods/Experiments: No predictive performance metrics (AUC, accuracy, calibration error, or Brier score) are reported for the elite-trained models when evaluated on the NTHU dataset. Without these, the SHAP and CIS values on university data cannot be confidently interpreted as reflecting genuine performance determinants rather than out-of-distribution extrapolation artifacts.
Authors: This is a valid and important omission. We have now evaluated both the Random Forest and MLP models (trained on elite data) directly on the NTHU test set and report AUC, accuracy, and Brier scores in a new table. The models achieve moderate but above-chance performance (AUC 0.62–0.71), confirming a transferable signal while also documenting the expected performance drop. The revised discussion explicitly links these metrics to the reliability of the subsequent SHAP and CIS interpretations, noting that the observed instability occurs even under this partial transfer. revision: yes
-
Referee: [Experiments] Experiments: The manuscript provides no covariate-shift diagnostics, feature-distribution comparisons, or domain-adaptation steps between the elite and university feature spaces. The weakest assumption—that the identical feature space captures comparable underlying determinants—therefore remains untested, leaving open the possibility that observed reordering and instability arise from data-quality or context differences rather than structural domain shift.
Authors: We accept that these diagnostics were missing and have added them. The revision includes Kolmogorov–Smirnov tests and density plots comparing feature distributions between the five elite leagues and NTHU, plus maximum-mean-discrepancy statistics quantifying overall covariate shift. We also state explicitly that no domain-adaptation techniques were applied, because the study’s purpose was to measure direct transfer; this choice is now justified in the methods. These controls support our interpretation that the reordering and instability reflect structural differences rather than data artifacts alone. revision: yes
Circularity Check
No significant circularity; empirical comparison of distinct datasets
full rationale
The manuscript is an empirical study that trains Random Forest and MLP models on elite-league event data, extracts SHAP and CIS explanations, and directly compares the resulting feature rankings and stability metrics against the same computations performed on a separate NTHU university dataset using an identical feature space. No mathematical derivations, first-principles predictions, or fitted parameters are presented that reduce to their own inputs by construction. The abstract and described experiments contain no self-citations, uniqueness theorems, or ansatzes that bear load on the central claim. Observed reordering and reduced stability are reported as direct empirical outcomes rather than tautological restatements of the training procedure.
Axiom & Free-Parameter Ledger
free parameters (1)
- Random Forest and MLP hyperparameters
axioms (2)
- domain assumption Event data features extracted identically from elite and university matches represent the same underlying performance constructs.
- domain assumption SHAP and Counterfactual Impact Score provide stable, comparable explanations of model behavior across domains.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS).
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Arik, S. O. and Pfister, T. , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
-
[2]
Atta Mills, E. F. E. and Deng, Z. and Zhong, Z. and Li, J. , title =. Journal of Big Data , year =
-
[3]
Berrar, D. and Lopes, P. and Dubitzky, W. , title =. Machine Learning , year =
-
[4]
Biermann, H. and Memmert, D. and Petersen, N. and Raabe, D. , title =. Data Mining and Knowledge Discovery , year =
- [5]
- [6]
- [7]
-
[8]
Cavus, Mustafa and Stando, Adrian and Biecek, Przemyslaw , title =. CoRR , year =
-
[9]
Chang, C.-C. and Lin, C.-J. , title =. ACM Transactions on Intelligent Systems and Technology , year =
-
[10]
Chen, T. and Guestrin, C. , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =
- [11]
-
[12]
Decroos, T. and Bransen, L. and Van Haaren, J. and Davis, J. , title =. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year =
-
[13]
Fran. Physical development differences between professional soccer players from different competitive levels , journal =. 2022 , volume =
work page 2022
-
[14]
Garc. In-game behaviour analysis of football players using machine learning techniques based on player statistics , journal =. 2021 , volume =
work page 2021
-
[15]
Performance analysis in sport , journal =
G. Performance analysis in sport , journal =. 2020 , volume =
work page 2020
-
[16]
Javed, D. and Jhanjhi, N. Z. and Khan, N. A. , title =. Innovation and Technology in Sports , publisher =. 2023 , pages =
work page 2023
-
[17]
Kusmakar, S. and Shelyag, S. and Zhu, Y. and Dwyer, D. and Gastin, P. and Angelova, M. , title =. IEEE Access , year =
- [18]
-
[19]
Lundberg, S. M. and Lee, S.-I. , title =. Advances in Neural Information Processing Systems , year =
-
[20]
Ma, Jiacheng and Liu, Shengrui and Pei, Yuting , title =. Scientific Reports , year =
-
[21]
Mackenzie, R. and Cushion, C. , title =. Journal of Sports Sciences , year =
- [22]
-
[23]
Moustakidis, Serafeim and Plakias, Spyridon and Kokkotis, Christos and Tsatalas, Themistoklis and Tsaopoulos, Dimitrios , title =. Future Internet , year =
- [24]
- [25]
-
[26]
Pantzalis, V. C. and Tjortjis, C. , title =. 2020 11th International Conference on Information, Intelligence, Systems and Applications (IISA) , year =
work page 2020
-
[27]
Pappalardo, L. and Cintia, P. and Rossi, A. and Massucco, E. and Ferragina, P. and Pedreschi, D. and Giannotti, F. , title =. ACM Transactions on Intelligent Systems and Technology , year =
-
[28]
Procopiou, Andria and Piki, Andriani , title =. Proceedings of the 11th International Conference on Sport Sciences Research and Technology Support (icSPORTS) , year =
-
[29]
Reep, C. and Benjamin, B. , title =. Journal of the Royal Statistical Society: Series A (General) , year =
- [30]
-
[31]
Sarmento, H. and Marcelino, R. and Anguera, M. T. and Campani. Match analysis in football: A systematic review , journal =. 2014 , volume =
work page 2014
-
[32]
Sarmento, H. and Clemente, F. M. and Ara. What performance analysts need to know about research trends in association football (2012--2016) , journal =. 2018 , volume =
work page 2012
-
[33]
Shrikumar, A. and Greenside, P. and Kundaje, A. , title =. Proceedings of the 34th International Conference on Machine Learning , year =
-
[34]
Stafylidis, A. and Mandroukas, A. and Michailidis, Y. and Vardakis, L. and Metaxas, I. and Kyranoudis, A. E. and Metaxas, T. I. , title =. Journal of Functional Morphology and Kinesiology , year =
-
[35]
Machine Learning and Knowledge Extraction , volume=
Machine learning applied to professional football: Performance improvement and results prediction , author=. Machine Learning and Knowledge Extraction , volume=. 2025 , publisher=
work page 2025
-
[36]
Nature Communications , year =
Wang, Zhe and Velickovic, Petar and Hennes, Daniel and others , title =. Nature Communications , year =
- [37]
-
[38]
Annals of Operations Research , year =
Zhao, Tingting and Cabral, Jeffrey and Zhu, Guangyu , title =. Annals of Operations Research , year =
-
[39]
Decision Analytics Journal , volume=
A predictive analytics framework for forecasting soccer match outcomes using machine learning models , author=. Decision Analytics Journal , volume=. 2025 , publisher=
work page 2025
-
[40]
A machine learning approach for player and position adjusted expected goals in football (soccer) , author=. Franklin Open , volume=. 2023 , publisher=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.