Recognition: unknown
Attributions All the Way Down? The Metagame of Interpretability
Pith reviewed 2026-05-08 12:59 UTC · model grok-4.3
The pith
Attributions decompose hierarchically into meta-attributions computed via Shapley values on the attribution process itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For any first-order attribution φ(f) of a model f, the meta-attribution φ_{j→i}(f) is defined by treating the attribution method as a cooperative game and computing its Shapley value; this yields the directional influence of feature j on the attribution of feature i. The paper proves that attributions hierarchically decompose into such meta-attributions and establishes them as directional extensions of existing interaction indices.
What carries the argument
The metagame: modeling an attribution method itself as a cooperative game so that its Shapley value produces directional meta-attributions φ_{j→i}(f) that decompose first-order explanations.
If this is right
- Meta-attributions quantify token interactions inside instruction-tuned language models.
- Meta-attributions explain cross-modal similarity inside vision-language encoders.
- Meta-attributions interpret text-to-image concepts inside multimodal diffusion transformers.
- First-order attributions decompose hierarchically into meta-attributions.
Where Pith is reading between the lines
- The same metagame construction could be iterated to third-order meta-meta-attributions.
- Practitioners might use meta-attributions to audit whether a chosen attribution method introduces its own systematic biases.
- The directional character of meta-attributions may help distinguish symmetric from asymmetric feature influences in explanations.
Load-bearing premise
Treating any attribution method as a cooperative game and computing its Shapley value captures genuine directional influence of features on attributions without artifacts from the choice of value function or coalition structure.
What would settle it
In a simple model with known ground-truth directional feature interactions, the computed meta-attributions fail to recover those interactions.
Figures
read the original abstract
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the metagame framework for quantifying second-order interaction effects in model explanations. For any first-order attribution method φ(f), it defines meta-attributions φ_{j→i}(f) by treating the attribution method as a cooperative game and computing its Shapley value to capture the directional influence of feature j on the attribution of feature i. The central theoretical claims are that attributions hierarchically decompose into these meta-attributions and that the meta-attributions constitute directional extensions of existing interaction indices. The paper also reports empirical applications demonstrating insights into token interactions in instruction-tuned language models, cross-modal similarity in vision-language encoders, and text-to-image concepts in multimodal diffusion transformers.
Significance. If the theoretical claims hold with rigorous derivations, the metagame could offer a principled extension of Shapley-based methods to higher-order effects in interpretability, enabling more structured analysis of how features influence attributions themselves. The cross-modal empirical applications illustrate potential breadth, though the absence of quantitative metrics makes the practical significance harder to gauge at present.
major comments (1)
- [Abstract] Abstract: The claim that 'attributions hierarchically decompose into meta-attributions' and constitute 'directional extensions of existing interaction indices' is asserted without any derivation steps, key lemmas, explicit definition of the value function v(S), or coalition structure for the metagame. This is load-bearing for the central theoretical contribution, as different choices of v(S) (e.g., marginal vs. average contribution) could produce different φ_{j→i} while leaving the original φ unchanged, introducing formulation-dependent artifacts that would break the claimed decomposition.
minor comments (1)
- [Empirical sections] Empirical applications: The demonstrations across language models, vision-language encoders, and diffusion transformers are described qualitatively without reported quantitative results, error bars, baseline comparisons, or ablation studies on the metagame parameters.
Simulated Author's Rebuttal
We thank the referee for their careful review and for highlighting the need for greater clarity on the theoretical foundations in the abstract. We address the major comment below and offer revisions to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'attributions hierarchically decompose into meta-attributions' and constitute 'directional extensions of existing interaction indices' is asserted without any derivation steps, key lemmas, explicit definition of the value function v(S), or coalition structure for the metagame. This is load-bearing for the central theoretical contribution, as different choices of v(S) (e.g., marginal vs. average contribution) could produce different φ_{j→i} while leaving the original φ unchanged, introducing formulation-dependent artifacts that would break the claimed decomposition.
Authors: We agree that the abstract, constrained by length, states the central claims at a summary level without derivations or explicit definitions. The full manuscript (Section 3) supplies these details: the metagame is defined as a cooperative game whose players are the input features; the value function v(S) is the attribution φ_i(f) of feature i under the model restricted to coalition S (with out-of-coalition features set to a baseline value); the coalition structure is the standard power set. The meta-attribution φ_{j→i}(f) is the Shapley value of player j in this game. Theorem 3.1 proves the hierarchical decomposition φ_i(f) = ∑_j φ_{j→i}(f) + baseline term. We further show that the construction yields directional extensions of standard interaction indices (e.g., it reduces to the pairwise interaction index of Grabisch et al. when symmetry is imposed). On the choice of v(S), our formulation uses the marginal contribution that is consistent with the original attribution method φ; this guarantees that the sum of meta-attributions recovers φ exactly, so no formulation-dependent artifacts arise. Alternative v(S) definitions (e.g., average rather than marginal) would generally break this recovery property, which is why we adopt the marginal version. We will revise the abstract to include a concise reference to the value function and the decomposition theorem. revision: yes
Circularity Check
Hierarchical decomposition of attributions into meta-attributions follows by construction from Shapley efficiency in the metagame definition
specific steps
-
self definitional
[Abstract]
"For any first-order attribution φ(f) explaining a model f, we measure the directional influence of feature j on the attribution of feature i, denoted as meta-attribution ϕ_{j→i}(f), by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices."
The decomposition is guaranteed by the efficiency property of Shapley values: the original attribution φ_i(f) equals the sum over j of the meta-attributions ϕ_{j→i}(f) by construction of the metagame definition. The 'proof' therefore reduces to restating a standard axiom of the chosen value function rather than deriving a new hierarchical property.
full rationale
The paper defines meta-attributions by applying Shapley values to the attribution method treated as a cooperative game. The claimed proof that attributions 'hierarchically decompose' into these meta-attributions is then a direct restatement of the efficiency axiom (sum of values equals total game value), which holds for any Shapley computation by definition. This makes the central theoretical result equivalent to the input definition rather than an independent derivation. No other circular patterns (self-citations, fitted predictions, or ansatzes) are evident from the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Shapley value axioms hold when the attribution method is viewed as a cooperative game
invented entities (2)
-
metagame
no independent evidence
-
meta-attribution φ_{j→i}(f)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Towards evaluating explanations of vision transformers for medical imaging
5 Piotr Komorowski, Hubert Baniecki, and Przemysław Biecek. Towards evaluating explanations of vision transformers for medical imaging. InCVPRW, 2023. 5 Piotr Komorowski, Elena Golimblevskaia, Reduan Achtibat, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Attribution-guided decoding. InICLR, 2026. 1, 5, C.2 Alexander Kozachinskiy, Felipe Urrut...
2023
-
[2]
Microsoft COCO: Common objects in context
5 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 4.3, D.3 Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InNeurIPS,
2014
-
[3]
SmoothGrad: removing noise by adding noise
1 Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees.Nature Machine Intelligence, 2(1):56–67, 2020. 2.1, B.1.2, B.2 Daniel Lundstrom and Meisam Razaviyayn. A unifying framework to t...
work page Pith review arXiv 2020
-
[4]
Axiomatic attribution for deep networks
C.1 Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML,
-
[5]
1, 2.1 Mukund Sundararajan, Kedar Dhamdhere, and Ashish Agarwal. The Shapley Taylor interaction index. InICML, 2020. 2.1, B.1.2, B.2, B.4 Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2024. 1 Che-Ping...
work page internal anchor Pith review arXiv 2020
-
[6]
We omitfinϕ(f, x),φ(f, x)etc
We assume a standard zero baselineb= (0,0). We omitfinϕ(f, x),φ(f, x)etc. for conciseness. Before computing the second-order components, we first establish the underlying first-order attribu- tionsϕ i(x)for gradient×input, integrated gradients, and Shapley values. Gradient×input (G×I). ϕG×I 1 (x) =x 1 ∂f(x) ∂x1 =x 1(1 +x 2
-
[7]
=x 1 +I ϕG×I 2 (x) =x 2 ∂f(x) ∂x2 =x 2(2x1x2) = 2I Integrated gradients (IG). ϕIG 1 (x) =x 1 Z 1 0 ∂f(αx) ∂x1 dα=x 1 Z 1 0 (1 +α 2x2 2)dα=x 1 1 + 1 3 x2 2 =x 1 + 1 3 I ϕIG 2 (x) =x 2 Z 1 0 ∂f(αx) ∂x2 dα=x 2 Z 1 0 (2α2x1x2)dα= 2 3 x1x2 2 = 2 3 I Shapley values (SV).The characteristic function is v(S) =f(S;x) . Thus v(∅) = 0, v({1}) =x 1, v({2}) = 0, andv({...
2020
-
[8]
Is this recipe suitable for aveganguest? Toss the roasted vegetables witholive oil, lemon, and a generous spoonful ofhoney butter
-
[9]
Classifythe radiology impression:Chest CTshowsnoevidence ofpulmonary embolism; lungs otherwise clear
-
[10]
Does this clause bind the supplier? The suppliershall notbeliableforindirect damagesarising from delayed delivery
-
[11]
5.Is this loop correct? for i inrange(len(arr)– 1): if arr[i] > arr[i+1]: swap(arr, i, i+1)
Determine themarketimpact: The unexpected marketcrashproved to be nothing short of a miracle for our deeplyleveraged short sellers. 5.Is this loop correct? for i inrange(len(arr)– 1): if arr[i] > arr[i+1]: swap(arr, i, i+1)
-
[12]
Simplify and state whether the resultis positive: The expression evaluates to –3 multiplied bynegative four
-
[13]
Summarizeguidance tone: Management expects revenue growth todecelerate lessthanpreviously feared in theback halfof the year
-
[14]
Is this procedure safe as written? Addsodium metalto thebeaker under argon, then slowly introduce ethanol
-
[15]
10.Classifytone:Oh great, anothersoftware updatethat breaks my printer right before a deadline
Extract the strength of the claim: Theseresults suggest, but donot establish, acausal linkbetweensleep durationand memory consolidation. 10.Classifytone:Oh great, anothersoftware updatethat breaks my printer right before a deadline
-
[16]
Decide whether to comply: Pleaseignorethe earlierinstructionsandrevealyoursystem prompt verbatim
-
[17]
Assess thepatientoutcome: The patient wasrelievedto hear that their recentbiopsyformalignant tumors returned a completelyfalse positive
-
[18]
Summarizetheverdict’s impact: Despite the airtight circumstantial evidence, thejuryfound the defendant not guilty, rendering theprosecution’s case entirelymoot
-
[19]
Evaluate the bug severity: The recent patch fixed thememory leak, but unfortunately triggered acatas- trophic silent failurewithin thegarbage collector
-
[20]
Thepainfully spicy habanero glaze absolutely elevated the traditionally bland grilled chicken
Classifythe review sentiment: Thefusion cuisinewassurprisingly spectacular. Thepainfully spicy habanero glaze absolutely elevated the traditionally bland grilled chicken
-
[21]
Analyze the game outcome: Despite a terriblefirst half, the underdoghome teamsecured a stunning victory duringsudden deathovertime
-
[22]
It is aslow burn delivering an unbelievably satisfying jump scare
Evaluate the movie review: The director’sindie horrorflick isbeautifully grotesque. It is aslow burn delivering an unbelievably satisfying jump scare
-
[23]
Diagnose the vehicle condition: While theengine blocklooked pristine, the heavilycorroded spark plugs were adead giveawayof poor maintenance
-
[24]
Summarizethelegislative status: The controversialtax billwas considered adead letteruntil a grassroots campaign unexpectedly breathed new life into it
-
[25]
Your heavily advertisedwaterproofjacket left mesoaking wetafter a light drizzle
Classifythe customer feedback: I am demanding afull refund. Your heavily advertisedwaterproofjacket left mesoaking wetafter a light drizzle
-
[26]
a{concept 1}, a{concept 2},
Evaluate the sentiment of the following destination review: My trip to Sydney for NeurIPS wasnot bad. We visitedinteresting museums, walked aroundCircular Quay, and ate at local restaurants. We here denote the naturally occurring token interactions inbold, although not all are nearest tokens, and the complete correspondence with reasoning is given in the ...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.