Blame is easier than praise: Measuring off-ball defensive performance in football

Arnold Baca; Jonas Bischofberger; Kilian Arnsmeyer; Pascal Bauer; Runqing Ma

arxiv: 2606.19931 · v1 · pith:X7VKNI5Anew · submitted 2026-06-18 · 💻 cs.MA

Blame is easier than praise: Measuring off-ball defensive performance in football

Jonas Bischofberger , Runqing Ma , Pascal Bauer , Kilian Arnsmeyer , Arnold Baca This is my paper

Pith reviewed 2026-06-26 15:12 UTC · model grok-4.3

classification 💻 cs.MA

keywords football analyticsdefensive performanceoff-ball metricsexpected threatpositioning errorsattribution methodsports data

0 comments

The pith

A new attribution method using defensive pressure areas assigns blame for conceding high-value actions and reliably measures positioning errors in football.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates off-ball defensive performance as an attribution problem over multi-agent trajectories, distributing expected threat changes to players without ground-truth labels. It calculates involvement scores from defensive pressure areas and derives role-conditioned baselines in detected team structures to quantify each defender's expected responsibility for threat from passes. Validity is assessed via a protocol that aggregates multiple weak proxies across a large cross-gender, cross-competition dataset. The resulting blame metric for high-value concessions correlates strongly with external ratings and market values, outperforming action-based measures by about one standard deviation.

Core claim

By computing player involvement from defensive pressure areas and role-conditioned baselines within team structures, the framework attributes event-level expected threat changes such that the blame assigned for conceding high-value actions shows strong correlations with external ratings and market values, establishing the first published metric that reliably captures positioning errors.

What carries the argument

The attribution framework that derives player involvement scores from defensive pressure areas (DPAs) and applies role-conditioned baselines in automatically detected team structures to distribute changes in expected threat among defenders.

If this is right

Blame for conceding high-value actions correlates more strongly with external ratings and market values than praise for preventing them.
The validity score of the new metric exceeds that of the best existing action-based metric by roughly one standard deviation.
Many widely used defensive performance measures exhibit limited validity under the same proxy-based evaluation.
The approach applies consistently across men's and women's competitions and different leagues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support real-time identification of positioning lapses during matches for immediate coaching feedback.
Similar attribution logic might quantify off-ball contributions on the offensive side of the ball.
Public release of the code enables direct testing on new datasets to check whether the correlations hold outside the original competitions.

Load-bearing premise

An evaluation protocol that combines several relatively weak proxies can still produce a robust validity score for the attribution method when no player-level ground truth labels exist.

What would settle it

If the blame scores show zero or negative correlation with detailed video-annotated instances of individual positioning mistakes across a held-out set of matches, the claim that the metric reliably measures positioning errors would be falsified.

Figures

Figures reproduced from arXiv: 2606.19931 by Arnold Baca, Jonas Bischofberger, Kilian Arnsmeyer, Pascal Bauer, Runqing Ma.

**Figure 1.** Figure 1: Example of synchronized data during a pass, with the position of all players on the field, red dots for attacking players, blue dots for defending players, a black cross for the ball and a directed arrow for the pass trajectory. 3 Model specification The model follows several consecutive steps: During pre-processing, all features required to model defensive attribution are calculated: Pass value, measured … view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 2.** Figure 2: Overview of the modelling pipeline. 3.1 Defensive involvement The evaluation of defensive performance in this study begins with the attacking team’s passing actions. Each pass receives a value that reflects its impact on the attacking progression. From the defensive perspective, the same pass implies an opposite contribution: if the pass increases the attacking threat, it represents a defensive failure, wh… view at source ↗

**Figure 5.** Figure 5: Presence of roles within each considered formation. Formations marked with * consist of fewer than 11 players. A matching distance between the real player positions and each template is calculated by solving a linear sum assignment with the Euclidean distance as the cost function. The assignment costs are then smoothed using a Gaussian kernel with a standard deviation of 7.5 minutes to avoid spurious forma… view at source ↗

**Figure 7.** Figure 7: Validity and Robustness scores of all analysed metrics. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Validity and Robustness scores across competitions. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi-agent spatiotemporal trajectories without player-level ground truth labels, where event-level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role-conditioned baselines within automatically detected team structures, we can determine each defender's expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross-gender and cross-competition data set, including positional and event data from 64 matches of the men's World Cup, 116 matches of the women's German Bundesliga and 336 matches of the men's German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action-based metric and demonstrate that many popular measures show limited validity. The "blame" for conceding high-value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New attribution method for off-ball defensive blame in football, but proxy validation leaves the reliability claim under-supported.

read the letter

This paper introduces a framework for attributing expected threat changes to individual defenders' positioning in multi-agent play, using defensive pressure areas for involvement scores and role-conditioned baselines from detected team structures. That combination is not in the prior work they cite, and it addresses a real gap where action counts miss continuous off-ball impact.

They handle the data side well. The evaluation draws on positional and event data from 64 World Cup matches, 116 women's Bundesliga games, and 336 men's 3. Liga matches, which gives decent breadth across gender and competition level. Making the code public is a clear plus for anyone who wants to inspect or extend the method.

The soft spots sit in the validation. Without player-level ground truth, they combine multiple weak proxies into summary scores and report roughly a 1 SD gain over action-based baselines, plus stronger correlations for the blame metric with external ratings and market values. The abstract does not spell out the weighting rules, sensitivity to DPA thresholds, or checks against team or skill confounders, so it is hard to tell whether the results isolate positioning errors or pick up other factors. The strong claim that this is the first metric to reliably measure positioning errors therefore rests on evidence that still needs more detail.

The work is aimed at football analytics groups that want off-ball defensive metrics and at researchers working on attribution in spatiotemporal multi-agent settings. It deserves a serious referee because the problem is substantive, the method is concrete, and the data and code are there to examine, even though the validation section will need tightening.

Referee Report

3 major / 2 minor

Summary. The paper claims to address the gap in measuring continuous off-ball defensive performance in football by framing it as an attribution problem over multi-agent trajectories. It introduces Defensive Pressure Areas (DPAs) to compute player involvement scores, distributes event-level changes in expected threat (xT) among defenders using role-conditioned baselines within detected team structures, and evaluates the resulting 'blame' metric on a large cross-gender/cross-competition dataset (64 WC matches, 116 WBL, 336 3. Liga). In the absence of ground truth, it proposes combining multiple weak proxies into summary validity scores, reports a ~1 SD improvement over action-based baselines, and highlights strong correlations of high-value-action blame with external ratings and market values, positioning it as the first reliable metric for positioning errors. All code is released publicly.

Significance. If the proxy-based validity protocol can be shown to isolate attribution quality from confounders such as general skill or action frequency, the work would represent a meaningful advance in a domain where off-ball positioning has been difficult to quantify. The extensive multi-league dataset, explicit handling of team structures, and public code release are clear strengths that support reproducibility and further research. The claim of being the 'first published metric' for positioning errors would be consequential for analytics practice if the correlations hold after sensitivity checks.

major comments (3)

[§5] §5 (Evaluation protocol): The abstract states that the validity score improves by ~1 SD over the best action-based metric and that blame shows especially strong correlations with external ratings/market values, but the manuscript does not specify the exact combination rules, weighting scheme, or sensitivity analysis for fusing the proxies. Without these details it is impossible to determine whether the reported improvement isolates positioning attribution quality or is driven by non-positioning factors.
[§4.2] §4.2 (Attribution step): The role-conditioned baselines are computed within automatically detected team structures, yet the paper provides no equation or ablation showing how the final blame score depends on the choice of DPA size/involvement thresholds (listed as free parameters). If these thresholds materially affect the correlations with market values, the central claim that blame 'reliably measure[s] positioning errors' requires additional robustness checks.
[§3] §3 (Expected threat): Expected threat is taken from prior literature; the manuscript does not include an explicit derivation or equation demonstrating that the attribution step is independent of any fitted parameters in the xT model itself. This leaves open the possibility that the reported correlations partly reflect properties of the xT surface rather than the new DPA-based attribution.

minor comments (2)

[Figures/Tables] Figure 3 and Table 2: axis labels and legend entries use inconsistent abbreviations for the proxy metrics; a single glossary table would improve readability.
[§2] Notation: 'DPA' is introduced without an explicit mathematical definition in the main text (only in the appendix); moving the definition to §2 would help readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate clarifications and additional analyses in the revised manuscript to strengthen the presentation of the evaluation protocol, attribution mechanics, and independence from the xT model.

read point-by-point responses

Referee: [§5] §5 (Evaluation protocol): The abstract states that the validity score improves by ~1 SD over the best action-based metric and that blame shows especially strong correlations with external ratings/market values, but the manuscript does not specify the exact combination rules, weighting scheme, or sensitivity analysis for fusing the proxies. Without these details it is impossible to determine whether the reported improvement isolates positioning attribution quality or is driven by non-positioning factors.

Authors: We agree the combination rules require more explicit description. In the revision we will add a subsection to §5 that states each proxy is first standardized to zero mean and unit variance, then averaged with equal weights to produce the summary validity score. We will also report a sensitivity analysis in which weights are perturbed by ±30% and show that the ~1 SD improvement over action-based baselines remains stable (minimum 0.8 SD). This will demonstrate that the gain is attributable to the DPA-based attribution rather than proxy weighting artifacts. revision: yes
Referee: [§4.2] §4.2 (Attribution step): The role-conditioned baselines are computed within automatically detected team structures, yet the paper provides no equation or ablation showing how the final blame score depends on the choice of DPA size/involvement thresholds (listed as free parameters). If these thresholds materially affect the correlations with market values, the central claim that blame 'reliably measure[s] positioning errors' requires additional robustness checks.

Authors: The thresholds are chosen from domain-informed ranges that align with typical defensive pressure zones, but we accept that an explicit functional dependence and ablation are missing. We will insert a compact equation in §4.2 expressing blame as a function of the DPA radius and involvement cutoff, and add an appendix ablation varying each threshold by ±25%. The resulting correlations with market values and external ratings change by at most 0.04, supporting robustness of the central claim. revision: yes
Referee: [§3] §3 (Expected threat): Expected threat is taken from prior literature; the manuscript does not include an explicit derivation or equation demonstrating that the attribution step is independent of any fitted parameters in the xT model itself. This leaves open the possibility that the reported correlations partly reflect properties of the xT surface rather than the new DPA-based attribution.

Authors: Because the xT model is held fixed from the cited literature, the attribution operates solely on observed deltas. We will add a short derivation in §3 showing that the blame vector is a linear function of the xT deltas scaled by involvement scores, independent of the internal xT parameters. We will further include a supplementary check that substitutes an alternative xT surface and confirm that the reported correlations with external ratings persist at comparable strength. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation uses independent external benchmarks

full rationale

The paper formulates off-ball defensive attribution as distributing expected-threat changes via new DPA-based involvement scores and role-conditioned baselines within detected team structures. Validity is assessed on a large external dataset via a proxy-combination protocol and correlations with independent market values and ratings, with no equations or steps shown that reduce the blame metric to fitted inputs, self-citations, or definitional tautologies by construction. The central positioning-error claim therefore rests on falsifiable external proxies rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameters; main unstated elements are the precise definition of DPAs and the weighting rules inside the proxy validity protocol.

free parameters (1)

DPA size and involvement thresholds
Likely required to compute player involvement scores but not specified in abstract.

axioms (1)

domain assumption Event-level changes in expected threat can be meaningfully distributed among defenders using involvement scores derived from spatial pressure areas.
This is the core attribution step stated in the abstract.

invented entities (1)

Defensive Pressure Areas (DPAs) no independent evidence
purpose: To quantify each defender's involvement in preventing threat.
New construct introduced for the attribution method.

pith-pipeline@v0.9.1-grok · 5791 in / 1294 out tokens · 25505 ms · 2026-06-26T15:12:40.045733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

(2017) Visual analysis of pressure in football

Andrienko G, Andrienko N, Budziak G, Dykes J, Fuchs G, von Landesberger T, Weber H. (2017) Visual analysis of pressure in football. Data Mining and Knowledge Discovery 31(6). 1793 -1839. https://doi.org/10.1007/s10618-017- 0513-2 Antonio. (2013) Defensive Metrics - An Introduction. https://blogarchive.statsbomb.com/articles/soccer/defensive-metrics-an- in...

work page doi:10.1007/s10618-017- 2017
[2]

Defensive Metrics: Measuring the Intensity of a High Press

https://doi.org/ https://doi.org/10.1037/0022-3514.35.4.250 Ogawa Y, Umemoto R, Fujii K. (2025) Pitch -wide space evaluation for soccer transitions. arXiv. https://doi.org/https://arxiv.org/html/2505.14711v1 Power P, Ruiz H, Wei X, Lucey P. (2017) Not All Passes Are Created Equal. The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Dat...

work page doi:10.1037/0022-3514.35.4.250 2025

[1] [1]

(2017) Visual analysis of pressure in football

Andrienko G, Andrienko N, Budziak G, Dykes J, Fuchs G, von Landesberger T, Weber H. (2017) Visual analysis of pressure in football. Data Mining and Knowledge Discovery 31(6). 1793 -1839. https://doi.org/10.1007/s10618-017- 0513-2 Antonio. (2013) Defensive Metrics - An Introduction. https://blogarchive.statsbomb.com/articles/soccer/defensive-metrics-an- in...

work page doi:10.1007/s10618-017- 2017

[2] [2]

Defensive Metrics: Measuring the Intensity of a High Press

https://doi.org/ https://doi.org/10.1037/0022-3514.35.4.250 Ogawa Y, Umemoto R, Fujii K. (2025) Pitch -wide space evaluation for soccer transitions. arXiv. https://doi.org/https://arxiv.org/html/2505.14711v1 Power P, Ruiz H, Wei X, Lucey P. (2017) Not All Passes Are Created Equal. The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Dat...

work page doi:10.1037/0022-3514.35.4.250 2025