Playing the network backward: A Game Theoretic Attribution Framework
Pith reviewed 2026-05-08 13:22 UTC · model grok-4.3
The pith
Backward attribution methods arise as equilibria in a two-player game on the extended network graph, turning explanation design into strategy selection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Backward attribution calculations are equivalent to integrals over trajectories in a two-player game on the extended network graph. Gradients arise under one equilibrium while the alpha-beta-LRP family arises under others; the resulting attribution maps are projections of the trajectory distributions. Game concepts such as policy regularization and extended action sets translate into novel adaptations of the backward rules that preserve core properties while adding specified behaviors.
What carries the argument
The two-player game on the extended network graph, in which equilibria and trajectory distributions recover standard attribution rules and generate new ones.
If this is right
- Gradients and the full alpha-beta-LRP family are recovered as integrals over trajectories under specific equilibria.
- Attribution maps become projections of trajectory distributions rather than the primary object.
- Explanation properties such as localisation focus or stable attention routing are specified as game concepts and translated into new backward rules.
- A selected adaptation of alpha-beta-LRP outperforms prior transformer-specific methods across all considered localisation metrics on ViT-B/16.
Where Pith is reading between the lines
- The game view could be used to combine multiple equilibria into hybrid attribution rules that trade off different properties.
- Testing whether varying risk aversion parameters improves explanation stability under input noise would be a direct next experiment.
- The framework suggests a route to import solution concepts from game theory to design attributions for architectures beyond vision transformers.
Load-bearing premise
The original backward attribution calculations can be recast as equilibria and trajectory distributions in the two-player game without distorting their mathematical properties or introducing artifacts that change explanation quality.
What would settle it
A direct check that a newly derived game adaptation of alpha-beta-LRP produces attribution maps mathematically inconsistent with the known alpha-beta-LRP formulas, or that it fails to improve localisation metrics on ViT-B/16 relative to prior rules, would falsify the claim of faithful recovery and useful extension.
Figures
read the original abstract
Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization, risk aversion, and extended action sets, and translate directly into novel adaptations of the well-known backward rules. On ViT-B/16, one such selected adaptation of alpha-beta-LRP outperforms prior transformer-specific backward methods across all considered localisation metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a game-theoretic framework for backward attribution by recasting it as a two-player game on an extended network graph, extending the ReLU Net Game. It claims that gradients and the full alpha-beta-LRP family arise exactly as integrals over game trajectories under specific equilibria, allowing desired properties (localisation, robustness) to be encoded as game concepts such as policy regularization or extended action sets. Novel adaptations of alpha-beta-LRP are derived and shown to outperform prior transformer-specific backward rules on ViT-B/16 across localisation metrics.
Significance. If the claimed exact equivalences hold without distortion for all layer types, the framework supplies a unifying lens that could systematize the design of attribution methods and translate explanation desiderata into game-theoretic primitives. The empirical gains on ViT-B/16 provide concrete evidence of utility for transformer interpretability. The absence of free parameters in the core mapping and the machine-checkable nature of the special-case recoveries (if supplied) would strengthen the contribution.
major comments (3)
- [§3] §3 (derivation of equilibria): The central claim that gradients and alpha-beta-LRP arise as integrals over trajectories requires explicit equilibrium conditions, payoff matrices, and a proof sketch showing that the original relevance-propagation rules (including alpha/beta weighting and gradient cases) are recovered exactly for every layer type, especially attention and softmax operations in ViT. Without these, it is impossible to verify that the trajectory measure introduces no implicit smoothing or artifacts.
- [§4.2] §4.2 (extended graph construction): The translation of standard backward rules into action sets and payoffs on the extended graph must be shown to be faithful; any layer-specific choice of equilibria risks non-uniqueness or distortion that would undermine the assertion that attribution maps are merely projections of trajectory distributions.
- [§5] §5 (empirical evaluation): The reported outperformance uses one selected adaptation of alpha-beta-LRP; the manuscript should clarify whether this adaptation was chosen after seeing the results and whether the full family of game-theoretic adaptations was evaluated to support the claim that the framework enables principled improvements.
minor comments (2)
- [Figure 1] The notation for game trajectories and their distributions would benefit from an explicit diagram relating the extended graph to the original network layers.
- [Abstract] Several sentences in the abstract and introduction repeat the unification claim without distinguishing between the theoretical mapping and the empirical adaptations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (derivation of equilibria): The central claim that gradients and alpha-beta-LRP arise as integrals over trajectories requires explicit equilibrium conditions, payoff matrices, and a proof sketch showing that the original relevance-propagation rules (including alpha/beta weighting and gradient cases) are recovered exactly for every layer type, especially attention and softmax operations in ViT. Without these, it is impossible to verify that the trajectory measure introduces no implicit smoothing or artifacts.
Authors: We agree that greater explicitness will improve verifiability. In the revised manuscript we will expand §3 with the equilibrium conditions and payoff matrices for each layer type. We will also supply a proof sketch that recovers the original gradient and alpha-beta-LRP rules exactly, with dedicated treatment of attention and softmax layers in ViT, confirming that the trajectory integrals introduce no smoothing or other artifacts. revision: yes
-
Referee: [§4.2] §4.2 (extended graph construction): The translation of standard backward rules into action sets and payoffs on the extended graph must be shown to be faithful; any layer-specific choice of equilibria risks non-uniqueness or distortion that would undermine the assertion that attribution maps are merely projections of trajectory distributions.
Authors: We will revise §4.2 (and add an appendix if space is needed) to present the explicit translation of each standard backward rule into action sets and payoffs on the extended graph. The revision will specify the equilibrium selection rule per layer type and demonstrate that the resulting attribution maps are faithful projections of the trajectory distributions, thereby removing any ambiguity about non-uniqueness or distortion. revision: yes
-
Referee: [§5] §5 (empirical evaluation): The reported outperformance uses one selected adaptation of alpha-beta-LRP; the manuscript should clarify whether this adaptation was chosen after seeing the results and whether the full family of game-theoretic adaptations was evaluated to support the claim that the framework enables principled improvements.
Authors: We will clarify in the revised §5 that the reported adaptation was derived from the game-theoretic desiderata (policy regularization for localization) before the experiments were run. We will also report results for the other adaptations considered under the framework, thereby supporting the claim that the framework enables principled improvements rather than post-hoc selection. revision: partial
Circularity Check
No circularity: unification via external ReLU Net Game
full rationale
The paper builds its framework explicitly on the external Gaubert and Vlassopoulos ReLU Net Game and presents gradients plus the alpha-beta-LRP family as special cases arising under chosen equilibria. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided abstract and description. The central claim is a recasting that treats attribution maps as projections of trajectory distributions; this remains an independent modeling choice rather than a reduction to the paper's own inputs by construction. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Backward attribution calculations can be represented as equilibria and trajectory distributions in a two-player game on an extended network graph
invented entities (2)
-
Extended network graph for the game
no independent evidence
-
Game trajectories and their distributions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Quantifying Attention Flow in Transformers
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. URL https://aclanthology.org/2020.acl-main.385/. Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. AttnLRP: Attention-aware layer-wise relevance propagation for transformers. In Ruslan S...
-
[2]
Finite-time analysis of the multiarmed bandit problem
URLhttps://openreview.net/forum?id=B1J_rgWRW. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit prob- lem.Machine Learning, 47(2–3):235–256, 2002. doi: https://doi.org/10.1023/A:1013689704352. Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On...
-
[3]
doi: https://doi.org/10.1016/j.patcog.2021.108194. URLhttps://www.sciencedirect. com/science/article/pii/S0031320321003769. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 ...
-
[4]
Attribution Patching Outperforms Automated Circuit Discovery
URLhttp://arxiv.org/abs/1409.1556. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. InWorkshop at International Conference on Learning Representations, 2014. URLhttps://arxiv.org/abs/1312.6034. Leon Sixt, Maximilian Granz, and Tim Landgraf. When explanati...
-
[5]
with P ′ l−1 =j ; the prepended edge uses W (l) ij , contributing |W (l) ij | to |w|(P) and either preserving or flipping parity depending on sgn(W (l) ij ). Crucially, the new gate factor G<l(P) for P equals G≤l−1(P ′): up to layer l−1 the gates are those along P ′, and there is no gate at layerlin theG <l product. Case W (l) ij >0 .Then W (l,+) ij =|W (...
work page 2025
-
[6]
Separate player-specific payoffs.The original formulation tracks only one game value, which equals ±a(l) i of the original network. We additionally maintain the non-negative payoff per player, and for each state its expectation over future paths. The original game value is then recovered as the difference of these two player-specific quantities
-
[7]
Terminal SG values split into the positive and negative input parts.In the original formulation the game values at the input layer are the signed scalars ±xk. We replace these by the non-negative pair x+ k = max(x k,0) and x− k = max(−x k,0) , which in the SG keeps every player-specific payoff non-negative. This exposes the parity trajectory decomposition...
-
[8]
Oracle with γO = 2 is not forced by the architecture; it is the simplest choice inside a one-parameter family of conservation-preserving Oracles (p,1−p) with matching discounts (1/p,1/(1−p)) . Any such split preserves the forward equivalence of Proposi- tion 1, since the constraint the forward pass imposes is that the player-specific value at an addition ...
-
[9]
Network Activation Gradient.The ordinary-network gradient of a(m) j with respect to the scalar activationa (l) i is ∂a(m) j ∂a(l) i =ξ q,+ Γu(s(l,act) i,+ )−Γ u(s(l,act) i,− ) .(74) Proof.We establish Part 1 by backward induction on the layer gapm−land derive Part 2 from it. Part 2.By Theorem 5, a(l) i =a (l,+) stop,i −a (l,−) stop,i and a(m) j =a (m,+) s...
work page 2025
-
[10]
for a textbook proof. As in Section 3.2, write ℓ(z) := logz, z >0, −∞, z= 0, ω, z <0, ω <−∞<0.(94) Thus, zero routed mass is assigned the stopping value−∞, while genuinely negative mass is assigned the strictly worse formal value ω. Moreover, we define the exponential function to evaluate to 0 both on−∞andω. exp(ω) := exp(−∞) := 0(95) We remark that...
work page 2015
-
[11]
Residual Addition.Let z=x op +y op be an addition node with addition state sadd z,p and operand states sxop,p, syop,p in the notation of Definition 10. For every player label p∈ {+,−}, Γx sxop,p = 1 2 Γx sadd z,p ,Γ x syop,p = 1 2 Γx sadd z,p ,(133) so the operand pair carries the full addition-state mass with no duplication. 39
-
[12]
Max Pooling.For a pooled output z= max{x 1, . . . , xm} with winner k⋆ and pooling state smax z,p (Definition 11), the deterministic value-maximising transition concentrates all mass on the winner: for every player labelp∈ {+,−}, Γx sxk⋆ ,p = Γ x smax z,p ,Γ x sxr,p = 0forr̸=k ⋆.(134) Proof.We proveR (L) u ·Γ (l) j =R (l) j at every layerlby backward indu...
-
[13]
Sign-oracle split (output activation → output sign-branch, fixed (q, d)).At s(att,O,act) q,d,p an unobserved Oracle transitions uniformly to s(att,O,lin) (q,d),p,+ (player p retains the turn, trajectory discount 2α) or to s(att,O,lin) (q,d),p′,− (turn switches to opponent p′, discount 2β), each with probability 1
-
[14]
The feature indexdis preserved
-
[15]
Value-routing policy (output sign-branch→ V-projection linear).At s(att,O,lin) (q,d),p,σ the active player picks a key tokenkby the mixed action π⋆ q,d,σ(k) = Aq,k ˜vσ k,d Z σ q,d ,(148) derived in §E.1.4 as the equilibrium of a KL-regularised log-payoff problem against the reference µq =A q,·. The trajectory transitions to s(att,V,lin) (k,d),p,σ with the...
-
[16]
V-projection routing (V-projection linear → input activation).At s(att,V,lin) (k,d),p,σ the active player picks an input dimeby the standard linear-state Gibbs policy of Definition 9 on the σ-stream weightsW σ V,e,d: π⋆ V,k,d,σ(e) = W σ V,e,d Xk,e ˜vσ k,d .(149) The trajectory transitions to s(att,X,act) (k,e),p with the player label preserved and traject...
work page 2026
-
[17]
= 0), tapering to 0 at the boundary π∈ {0,1} . It rewardsindecisiveness— exactly the role Shannon entropy H(π) plays in the Softplus variant of §C.3, where the entropy bonus is the active player’s surplus from being allowed to mix. The optimum π⋆ = Φ(z) =E ε∼N(0,1) [1(z+ε >0) ] is the hard ReLU gate averaged over a Gaussian shift of its threshold, alignin...
work page 2016
-
[18]
Mode-selection grid on the custom 50-image validation split.For every method we sweep the per-method ranges in Table 5. The single configuration per method reported in Tables 1 and 2 is the validation winner under the localisation rank-sum criterion below. The larger quantitative appendix tables (Appendix G.2) report the retained top configurations from t...
-
[19]
on the main-paper attribution methods; §J.2 carries the same protocol over to the trajectory- space Hellinger diagnostic of Appendix I; §J.4 compares the two per image. J.1 Heatmap similarity at full randomisation We apply the cascading parameter randomization test of Adebayo et al. [2018] to all attribution meth- ods evaluated in Section 6. Starting from...
work page 2018
-
[20]
into (265) gives d= 0.1 , κ= 0.5·0.9 = 0.45 , and asymptote c∞ ≈0.18 , hence H∞ ≈ √1−0.18≈0.91 — within the same order of magnitude as the empirically observed H≈0.96 . The remaining discrepancy reflects correlation between the pretrained and randomized dead masks (not truly independent), concentration of β(l) ord away from deadA (whichlowers d relative t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.