pith. sign in

arxiv: 2605.06212 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CV

Playing the network backward: A Game Theoretic Attribution Framework

Pith reviewed 2026-05-08 13:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords attribution methodsexplainable AIgame theoryLRPvision transformersneural networksinterpretabilitygradients
0
0 comments X

The pith

Backward attribution methods arise as equilibria in a two-player game on the extended network graph, turning explanation design into strategy selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework by modeling backward attribution passes as a two-player game on an extended network graph. In this model, standard techniques such as gradients and the full alpha-beta-LRP family emerge as integrals over game trajectories under particular equilibria, so that attribution maps appear as projections of trajectory distributions. Desired properties of explanations, including localisation focus and robustness, can be expressed as game-theoretic notions like policy regularization or risk aversion and converted directly into new backward rules. One such adapted alpha-beta-LRP rule is shown to outperform earlier transformer-specific methods on all tested localisation metrics for ViT-B/16. If the recasting holds, attribution research gains a common language for comparing methods and deriving targeted variants rather than developing them in isolation.

Core claim

Backward attribution calculations are equivalent to integrals over trajectories in a two-player game on the extended network graph. Gradients arise under one equilibrium while the alpha-beta-LRP family arises under others; the resulting attribution maps are projections of the trajectory distributions. Game concepts such as policy regularization and extended action sets translate into novel adaptations of the backward rules that preserve core properties while adding specified behaviors.

What carries the argument

The two-player game on the extended network graph, in which equilibria and trajectory distributions recover standard attribution rules and generate new ones.

If this is right

  • Gradients and the full alpha-beta-LRP family are recovered as integrals over trajectories under specific equilibria.
  • Attribution maps become projections of trajectory distributions rather than the primary object.
  • Explanation properties such as localisation focus or stable attention routing are specified as game concepts and translated into new backward rules.
  • A selected adaptation of alpha-beta-LRP outperforms prior transformer-specific methods across all considered localisation metrics on ViT-B/16.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The game view could be used to combine multiple equilibria into hybrid attribution rules that trade off different properties.
  • Testing whether varying risk aversion parameters improves explanation stability under input noise would be a direct next experiment.
  • The framework suggests a route to import solution concepts from game theory to design attributions for architectures beyond vision transformers.

Load-bearing premise

The original backward attribution calculations can be recast as equilibria and trajectory distributions in the two-player game without distorting their mathematical properties or introducing artifacts that change explanation quality.

What would settle it

A direct check that a newly derived game adaptation of alpha-beta-LRP produces attribution maps mathematically inconsistent with the known alpha-beta-LRP formulas, or that it fails to improve localisation metrics on ViT-B/16 relative to prior rules, would falsify the claim of faithful recovery and useful extension.

Figures

Figures reproduced from arXiv: 2605.06212 by Georg Loho, Jakob Paul Zimmermann, Jim Berend, Sebastian Lapuschkin, Wojciech Samek.

Figure 1
Figure 1. Figure 1: We lift the backward pass through a network into a two-player game on an extended view at source ↗
Figure 2
Figure 2. Figure 2: Hellinger trajectory distance under cascading parameter randomisation [Adebayo et al., view at source ↗
Figure 3
Figure 3. Figure 3: Temperature sweeps aligned with Table 3(a) and (c) show the focus of attribution at lower view at source ↗
Figure 4
Figure 4. Figure 4: Stopping Game: trajectory distribution and local stopping decisions on a toy subnetwork. view at source ↗
Figure 5
Figure 5. Figure 5: Routing Game: local routing subgame around view at source ↗
Figure 6
Figure 6. Figure 6: Dense qualitative comparison on six ImageNet-S examples (ViT-B/16). Columns: Original; view at source ↗
Figure 7
Figure 7. Figure 7: Input-noise Hellinger trajectory distance. Solid: plain view at source ↗
Figure 8
Figure 8. Figure 8: Per-image standard deviation of the Hellinger trajectory distance vs. cascading random view at source ↗
read the original abstract

Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization, risk aversion, and extended action sets, and translate directly into novel adaptations of the well-known backward rules. On ViT-B/16, one such selected adaptation of alpha-beta-LRP outperforms prior transformer-specific backward methods across all considered localisation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a game-theoretic framework for backward attribution by recasting it as a two-player game on an extended network graph, extending the ReLU Net Game. It claims that gradients and the full alpha-beta-LRP family arise exactly as integrals over game trajectories under specific equilibria, allowing desired properties (localisation, robustness) to be encoded as game concepts such as policy regularization or extended action sets. Novel adaptations of alpha-beta-LRP are derived and shown to outperform prior transformer-specific backward rules on ViT-B/16 across localisation metrics.

Significance. If the claimed exact equivalences hold without distortion for all layer types, the framework supplies a unifying lens that could systematize the design of attribution methods and translate explanation desiderata into game-theoretic primitives. The empirical gains on ViT-B/16 provide concrete evidence of utility for transformer interpretability. The absence of free parameters in the core mapping and the machine-checkable nature of the special-case recoveries (if supplied) would strengthen the contribution.

major comments (3)
  1. [§3] §3 (derivation of equilibria): The central claim that gradients and alpha-beta-LRP arise as integrals over trajectories requires explicit equilibrium conditions, payoff matrices, and a proof sketch showing that the original relevance-propagation rules (including alpha/beta weighting and gradient cases) are recovered exactly for every layer type, especially attention and softmax operations in ViT. Without these, it is impossible to verify that the trajectory measure introduces no implicit smoothing or artifacts.
  2. [§4.2] §4.2 (extended graph construction): The translation of standard backward rules into action sets and payoffs on the extended graph must be shown to be faithful; any layer-specific choice of equilibria risks non-uniqueness or distortion that would undermine the assertion that attribution maps are merely projections of trajectory distributions.
  3. [§5] §5 (empirical evaluation): The reported outperformance uses one selected adaptation of alpha-beta-LRP; the manuscript should clarify whether this adaptation was chosen after seeing the results and whether the full family of game-theoretic adaptations was evaluated to support the claim that the framework enables principled improvements.
minor comments (2)
  1. [Figure 1] The notation for game trajectories and their distributions would benefit from an explicit diagram relating the extended graph to the original network layers.
  2. [Abstract] Several sentences in the abstract and introduction repeat the unification claim without distinguishing between the theoretical mapping and the empirical adaptations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (derivation of equilibria): The central claim that gradients and alpha-beta-LRP arise as integrals over trajectories requires explicit equilibrium conditions, payoff matrices, and a proof sketch showing that the original relevance-propagation rules (including alpha/beta weighting and gradient cases) are recovered exactly for every layer type, especially attention and softmax operations in ViT. Without these, it is impossible to verify that the trajectory measure introduces no implicit smoothing or artifacts.

    Authors: We agree that greater explicitness will improve verifiability. In the revised manuscript we will expand §3 with the equilibrium conditions and payoff matrices for each layer type. We will also supply a proof sketch that recovers the original gradient and alpha-beta-LRP rules exactly, with dedicated treatment of attention and softmax layers in ViT, confirming that the trajectory integrals introduce no smoothing or other artifacts. revision: yes

  2. Referee: [§4.2] §4.2 (extended graph construction): The translation of standard backward rules into action sets and payoffs on the extended graph must be shown to be faithful; any layer-specific choice of equilibria risks non-uniqueness or distortion that would undermine the assertion that attribution maps are merely projections of trajectory distributions.

    Authors: We will revise §4.2 (and add an appendix if space is needed) to present the explicit translation of each standard backward rule into action sets and payoffs on the extended graph. The revision will specify the equilibrium selection rule per layer type and demonstrate that the resulting attribution maps are faithful projections of the trajectory distributions, thereby removing any ambiguity about non-uniqueness or distortion. revision: yes

  3. Referee: [§5] §5 (empirical evaluation): The reported outperformance uses one selected adaptation of alpha-beta-LRP; the manuscript should clarify whether this adaptation was chosen after seeing the results and whether the full family of game-theoretic adaptations was evaluated to support the claim that the framework enables principled improvements.

    Authors: We will clarify in the revised §5 that the reported adaptation was derived from the game-theoretic desiderata (policy regularization for localization) before the experiments were run. We will also report results for the other adaptations considered under the framework, thereby supporting the claim that the framework enables principled improvements rather than post-hoc selection. revision: partial

Circularity Check

0 steps flagged

No circularity: unification via external ReLU Net Game

full rationale

The paper builds its framework explicitly on the external Gaubert and Vlassopoulos ReLU Net Game and presents gradients plus the alpha-beta-LRP family as special cases arising under chosen equilibria. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided abstract and description. The central claim is a recasting that treats attribution maps as projections of trajectory distributions; this remains an independent modeling choice rather than a reduction to the paper's own inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The framework rests on the prior ReLU Net Game and on the assumption that backward passes correspond to game trajectories.

axioms (1)
  • domain assumption Backward attribution calculations can be represented as equilibria and trajectory distributions in a two-player game on an extended network graph
    Central modeling step stated in the abstract; if false, the unification and new adaptations lose their foundation.
invented entities (2)
  • Extended network graph for the game no independent evidence
    purpose: To host the two-player game whose trajectories yield attribution maps
    Introduced to recast backward passes; no independent evidence provided in abstract
  • Game trajectories and their distributions no independent evidence
    purpose: To serve as the underlying object from which attribution maps are projected
    Core new object in the framework; no falsifiable handle given in abstract

pith-pipeline@v0.9.0 · 5482 in / 1409 out tokens · 54107 ms · 2026-05-08T13:22:51.938358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Quantifying Attention Flow in Transformers

    Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. URL https://aclanthology.org/2020.acl-main.385/. Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. AttnLRP: Attention-aware layer-wise relevance propagation for transformers. In Ruslan S...

  2. [2]

    Finite-time analysis of the multiarmed bandit problem

    URLhttps://openreview.net/forum?id=B1J_rgWRW. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit prob- lem.Machine Learning, 47(2–3):235–256, 2002. doi: https://doi.org/10.1023/A:1013689704352. Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On...

  3. [3]

    why should i trust you?

    doi: https://doi.org/10.1016/j.patcog.2021.108194. URLhttps://www.sciencedirect. com/science/article/pii/S0031320321003769. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 ...

  4. [4]

    Attribution Patching Outperforms Automated Circuit Discovery

    URLhttp://arxiv.org/abs/1409.1556. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. InWorkshop at International Conference on Learning Representations, 2014. URLhttps://arxiv.org/abs/1312.6034. Leon Sixt, Maximilian Granz, and Tim Landgraf. When explanati...

  5. [5]

    stub path

    with P ′ l−1 =j ; the prepended edge uses W (l) ij , contributing |W (l) ij | to |w|(P) and either preserving or flipping parity depending on sgn(W (l) ij ). Crucially, the new gate factor G<l(P) for P equals G≤l−1(P ′): up to layer l−1 the gates are those along P ′, and there is no gate at layerlin theG <l product. Case W (l) ij >0 .Then W (l,+) ij =|W (...

  6. [6]

    We additionally maintain the non-negative payoff per player, and for each state its expectation over future paths

    Separate player-specific payoffs.The original formulation tracks only one game value, which equals ±a(l) i of the original network. We additionally maintain the non-negative payoff per player, and for each state its expectation over future paths. The original game value is then recovered as the difference of these two player-specific quantities

  7. [7]

    We replace these by the non-negative pair x+ k = max(x k,0) and x− k = max(−x k,0) , which in the SG keeps every player-specific payoff non-negative

    Terminal SG values split into the positive and negative input parts.In the original formulation the game values at the input layer are the signed scalars ±xk. We replace these by the non-negative pair x+ k = max(x k,0) and x− k = max(−x k,0) , which in the SG keeps every player-specific payoff non-negative. This exposes the parity trajectory decomposition...

  8. [8]

    Oracle with γO = 2 is not forced by the architecture; it is the simplest choice inside a one-parameter family of conservation-preserving Oracles (p,1−p) with matching discounts (1/p,1/(1−p)) . Any such split preserves the forward equivalence of Proposi- tion 1, since the constraint the forward pass imposes is that the player-specific value at an addition ...

  9. [9]

    Part 2.By Theorem 5, a(l) i =a (l,+) stop,i −a (l,−) stop,i and a(m) j =a (m,+) stop,j −a (m,−) stop,j

    Network Activation Gradient.The ordinary-network gradient of a(m) j with respect to the scalar activationa (l) i is ∂a(m) j ∂a(l) i =ξ q,+ Γu(s(l,act) i,+ )−Γ u(s(l,act) i,− ) .(74) Proof.We establish Part 1 by backward induction on the layer gapm−land derive Part 2 from it. Part 2.By Theorem 5, a(l) i =a (l,+) stop,i −a (l,−) stop,i and a(m) j =a (m,+) s...

  10. [10]

    for a textbook proof. As in Section 3.2, write ℓ(z) :=    logz, z >0, −∞, z= 0, ω, z <0, ω <−∞<0.(94) Thus, zero routed mass is assigned the stopping value−∞, while genuinely negative mass is assigned the strictly worse formal value ω. Moreover, we define the exponential function to evaluate to 0 both on−∞andω. exp(ω) := exp(−∞) := 0(95) We remark that...

  11. [11]

    For every player label p∈ {+,−}, Γx sxop,p = 1 2 Γx sadd z,p ,Γ x syop,p = 1 2 Γx sadd z,p ,(133) so the operand pair carries the full addition-state mass with no duplication

    Residual Addition.Let z=x op +y op be an addition node with addition state sadd z,p and operand states sxop,p, syop,p in the notation of Definition 10. For every player label p∈ {+,−}, Γx sxop,p = 1 2 Γx sadd z,p ,Γ x syop,p = 1 2 Γx sadd z,p ,(133) so the operand pair carries the full addition-state mass with no duplication. 39

  12. [12]

    Max Pooling.For a pooled output z= max{x 1, . . . , xm} with winner k⋆ and pooling state smax z,p (Definition 11), the deterministic value-maximising transition concentrates all mass on the winner: for every player labelp∈ {+,−}, Γx sxk⋆ ,p = Γ x smax z,p ,Γ x sxr,p = 0forr̸=k ⋆.(134) Proof.We proveR (L) u ·Γ (l) j =R (l) j at every layerlby backward indu...

  13. [13]

    Sign-oracle split (output activation → output sign-branch, fixed (q, d)).At s(att,O,act) q,d,p an unobserved Oracle transitions uniformly to s(att,O,lin) (q,d),p,+ (player p retains the turn, trajectory discount 2α) or to s(att,O,lin) (q,d),p′,− (turn switches to opponent p′, discount 2β), each with probability 1

  14. [14]

    The feature indexdis preserved

  15. [15]

    Value-routing policy (output sign-branch→ V-projection linear).At s(att,O,lin) (q,d),p,σ the active player picks a key tokenkby the mixed action π⋆ q,d,σ(k) = Aq,k ˜vσ k,d Z σ q,d ,(148) derived in §E.1.4 as the equilibrium of a KL-regularised log-payoff problem against the reference µq =A q,·. The trajectory transitions to s(att,V,lin) (k,d),p,σ with the...

  16. [16]

    risk-averse

    V-projection routing (V-projection linear → input activation).At s(att,V,lin) (k,d),p,σ the active player picks an input dimeby the standard linear-state Gibbs policy of Definition 9 on the σ-stream weightsW σ V,e,d: π⋆ V,k,d,σ(e) = W σ V,e,d Xk,e ˜vσ k,d .(149) The trajectory transitions to s(att,X,act) (k,e),p with the player label preserved and traject...

  17. [17]

    = 0), tapering to 0 at the boundary π∈ {0,1} . It rewardsindecisiveness— exactly the role Shannon entropy H(π) plays in the Softplus variant of §C.3, where the entropy bonus is the active player’s surplus from being allowed to mix. The optimum π⋆ = Φ(z) =E ε∼N(0,1) [1(z+ε >0) ] is the hard ReLU gate averaged over a Gaussian shift of its threshold, alignin...

  18. [18]

    backward calculations

    Mode-selection grid on the custom 50-image validation split.For every method we sweep the per-method ranges in Table 5. The single configuration per method reported in Tables 1 and 2 is the validation winner under the localisation rank-sum criterion below. The larger quantitative appendix tables (Appendix G.2) report the retained top configurations from t...

  19. [19]

    J.1 Heatmap similarity at full randomisation We apply the cascading parameter randomization test of Adebayo et al

    on the main-paper attribution methods; §J.2 carries the same protocol over to the trajectory- space Hellinger diagnostic of Appendix I; §J.4 compares the two per image. J.1 Heatmap similarity at full randomisation We apply the cascading parameter randomization test of Adebayo et al. [2018] to all attribution meth- ods evaluated in Section 6. Starting from...

  20. [20]

    shadow-map

    into (265) gives d= 0.1 , κ= 0.5·0.9 = 0.45 , and asymptote c∞ ≈0.18 , hence H∞ ≈ √1−0.18≈0.91 — within the same order of magnitude as the empirically observed H≈0.96 . The remaining discrepancy reflects correlation between the pretrained and randomized dead masks (not truly independent), concentration of β(l) ord away from deadA (whichlowers d relative t...