arxiv: 2605.06295 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Attributions All the Way Down? The Metagame of Interpretability

Hubert Baniecki , Przemyslaw Biecek , Fabian Fumagalli

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords model interpretabilitymeta-attributionsShapley valuesinteraction indicescooperative gamessecond-order effectsexplainable AIhierarchical decomposition

0 comments

The pith

Attributions decompose hierarchically into meta-attributions computed via Shapley values on the attribution process itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the metagame to quantify second-order effects within model explanations. It treats any first-order attribution method as a cooperative game whose Shapley value yields a directional meta-attribution measuring how feature j influences the importance assigned to feature i. The central theoretical result is that attributions decompose into these meta-attributions, which serve as directional extensions of standard interaction indices. A reader would care because the same machinery is then applied to token interactions in language models, cross-modal alignments in vision-language models, and concept formation in diffusion transformers.

Core claim

For any first-order attribution φ(f) of a model f, the meta-attribution φ_{j→i}(f) is defined by treating the attribution method as a cooperative game and computing its Shapley value; this yields the directional influence of feature j on the attribution of feature i. The paper proves that attributions hierarchically decompose into such meta-attributions and establishes them as directional extensions of existing interaction indices.

What carries the argument

The metagame: modeling an attribution method itself as a cooperative game so that its Shapley value produces directional meta-attributions φ_{j→i}(f) that decompose first-order explanations.

If this is right

Meta-attributions quantify token interactions inside instruction-tuned language models.
Meta-attributions explain cross-modal similarity inside vision-language encoders.
Meta-attributions interpret text-to-image concepts inside multimodal diffusion transformers.
First-order attributions decompose hierarchically into meta-attributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metagame construction could be iterated to third-order meta-meta-attributions.
Practitioners might use meta-attributions to audit whether a chosen attribution method introduces its own systematic biases.
The directional character of meta-attributions may help distinguish symmetric from asymmetric feature influences in explanations.

Load-bearing premise

Treating any attribution method as a cooperative game and computing its Shapley value captures genuine directional influence of features on attributions without artifacts from the choice of value function or coalition structure.

What would settle it

In a simple model with known ground-truth directional feature interactions, the computed meta-attributions fail to recover those interactions.

Figures

Figures reproduced from arXiv: 2605.06295 by Fabian Fumagalli, Hubert Baniecki, Przemyslaw Biecek.

**Figure 1.** Figure 1: Complementary interpretations of a simple transformer solving integer addition. view at source ↗

**Figure 2.** Figure 2: METAGAME quantifies gradient-based token interactions in vision-language encoders. Given a token attribution method (Grad-ECLIP) and a dual encoder (Meta CLIP-2), we compute meta-attributions from text token subsets and their corresponding visual patch attributions. First-order attributions quantify the effects that text tokens dog and yellow have on the similarity map (red, most similar). Directional meta… view at source ↗

**Figure 3.** Figure 3: METAGAME quantifies token interactions in instruction-tuned large language models. We compute Meta-AttnLRP as Shapley values from text tokens into AttnLRP token attributions of the Gemma language model’s generated output, highlighting directional second-order effects. We also measure the recall of detecting human-labeled interactions (e.g. word connotations, negation) on a sample of prompts spanning variou… view at source ↗

**Figure 4.** Figure 4: METAGAME quantifies concept interactions in multimodal diffusion transformers. Shapley values average attention across concept subsets, and interpret their directional dependencies. stays the same or even increases as the number of additional concepts in-context increases. Notably, if we did not use cross-concept attention, the performance gap would be even larger. 5 Related Work While METAGAME shares a na… view at source ↗

**Figure 5.** Figure 5: Complementary interpretations of a simple transformer solving integer addition. view at source ↗

**Figure 6.** Figure 6: Complementary interpretations of a simple transformer solving integer addition. view at source ↗

**Figure 7.** Figure 7: shows an example of evaluating Meta-Grad-ECLIP interactions explaining MetaCLIP-2 on the fish-koala-balloon-laptop pointing game. fish fish koala fish koala balloon fish koala balloon laptop view at source ↗

**Figure 8.** Figure 8: reproduces the example of explaining MetaCLIP-2 with Meta-Grad-ECLIP from view at source ↗

**Figure 9.** Figure 9: Example of synergies and antisynergies between text tokens hot, dog, eating on the attribution of image patches. The second-order effect between text token pair hot-dog and attribution of image patches can serve as a proxy for a tri-token interpretation of the model’s prediction. 29 view at source ↗

**Figure 10.** Figure 10: METAGAME quantifies token interactions in instruction-tuned and pre-trained language models. Supplementary results extending view at source ↗

**Figure 11.** Figure 11: Examples of quantifying token interactions in instruction-tuned language models. view at source ↗

**Figure 12.** Figure 12: Examples of quantifying token interactions in instruction-tuned language models. view at source ↗

**Figure 13.** Figure 13: Ablations extending view at source ↗

read the original abstract

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The metagame defines meta-attributions by running Shapley on the attribution method itself, but the hierarchical decomposition claim rests on an unspecified value function that risks artifacts.

read the letter

The paper's core move is to define a metagame on top of any attribution method by computing Shapley values over the attributions themselves to get meta-attributions that show directional influence between features. This is presented as a way to quantify second-order effects in explanations. What is new is the specific reduction of the attribution method to a cooperative game and the claim that this yields a hierarchical decomposition of attributions into meta-attributions, plus a directional extension of interaction indices. The applications to language models, vision-language models, and diffusion transformers are also fresh in this context. The work does a decent job of sketching how this could be used in practice across those three areas, which gives readers a sense of where the metagame might add value beyond standard first-order attributions. The main soft spot is the theoretical claim. The abstract says they prove the hierarchical decomposition, but without a specified value function for the metagame or a clear coalition structure, the meta-attributions could vary with the choice of setup even if the original attribution stays fixed. That makes the decomposition look less robust than claimed. The empirical section is described but lacks any quantitative details or comparisons in the summary, so it's difficult to judge how well it actually works. This paper is aimed at interpretability researchers who want tools for higher-order analysis in complex models. A reader focused on Shapley extensions or auditing interactions in deployed systems would find the framework worth examining, even with the open questions. I think it deserves a serious referee because the idea is original and the applications are relevant to current models, though the review should push for explicit definitions and proof details.

Referee Report

1 major / 1 minor

Summary. The paper introduces the metagame framework for quantifying second-order interaction effects in model explanations. For any first-order attribution method φ(f), it defines meta-attributions φ_{j→i}(f) by treating the attribution method as a cooperative game and computing its Shapley value to capture the directional influence of feature j on the attribution of feature i. The central theoretical claims are that attributions hierarchically decompose into these meta-attributions and that the meta-attributions constitute directional extensions of existing interaction indices. The paper also reports empirical applications demonstrating insights into token interactions in instruction-tuned language models, cross-modal similarity in vision-language encoders, and text-to-image concepts in multimodal diffusion transformers.

Significance. If the theoretical claims hold with rigorous derivations, the metagame could offer a principled extension of Shapley-based methods to higher-order effects in interpretability, enabling more structured analysis of how features influence attributions themselves. The cross-modal empirical applications illustrate potential breadth, though the absence of quantitative metrics makes the practical significance harder to gauge at present.

major comments (1)

[Abstract] Abstract: The claim that 'attributions hierarchically decompose into meta-attributions' and constitute 'directional extensions of existing interaction indices' is asserted without any derivation steps, key lemmas, explicit definition of the value function v(S), or coalition structure for the metagame. This is load-bearing for the central theoretical contribution, as different choices of v(S) (e.g., marginal vs. average contribution) could produce different φ_{j→i} while leaving the original φ unchanged, introducing formulation-dependent artifacts that would break the claimed decomposition.

minor comments (1)

[Empirical sections] Empirical applications: The demonstrations across language models, vision-language encoders, and diffusion transformers are described qualitatively without reported quantitative results, error bars, baseline comparisons, or ablation studies on the metagame parameters.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for highlighting the need for greater clarity on the theoretical foundations in the abstract. We address the major comment below and offer revisions to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'attributions hierarchically decompose into meta-attributions' and constitute 'directional extensions of existing interaction indices' is asserted without any derivation steps, key lemmas, explicit definition of the value function v(S), or coalition structure for the metagame. This is load-bearing for the central theoretical contribution, as different choices of v(S) (e.g., marginal vs. average contribution) could produce different φ_{j→i} while leaving the original φ unchanged, introducing formulation-dependent artifacts that would break the claimed decomposition.

Authors: We agree that the abstract, constrained by length, states the central claims at a summary level without derivations or explicit definitions. The full manuscript (Section 3) supplies these details: the metagame is defined as a cooperative game whose players are the input features; the value function v(S) is the attribution φ_i(f) of feature i under the model restricted to coalition S (with out-of-coalition features set to a baseline value); the coalition structure is the standard power set. The meta-attribution φ_{j→i}(f) is the Shapley value of player j in this game. Theorem 3.1 proves the hierarchical decomposition φ_i(f) = ∑_j φ_{j→i}(f) + baseline term. We further show that the construction yields directional extensions of standard interaction indices (e.g., it reduces to the pairwise interaction index of Grabisch et al. when symmetry is imposed). On the choice of v(S), our formulation uses the marginal contribution that is consistent with the original attribution method φ; this guarantees that the sum of meta-attributions recovers φ exactly, so no formulation-dependent artifacts arise. Alternative v(S) definitions (e.g., average rather than marginal) would generally break this recovery property, which is why we adopt the marginal version. We will revise the abstract to include a concise reference to the value function and the decomposition theorem. revision: yes

Circularity Check

1 steps flagged

Hierarchical decomposition of attributions into meta-attributions follows by construction from Shapley efficiency in the metagame definition

specific steps

self definitional [Abstract]
"For any first-order attribution φ(f) explaining a model f, we measure the directional influence of feature j on the attribution of feature i, denoted as meta-attribution ϕ_{j→i}(f), by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices."

The decomposition is guaranteed by the efficiency property of Shapley values: the original attribution φ_i(f) equals the sum over j of the meta-attributions ϕ_{j→i}(f) by construction of the metagame definition. The 'proof' therefore reduces to restating a standard axiom of the chosen value function rather than deriving a new hierarchical property.

full rationale

The paper defines meta-attributions by applying Shapley values to the attribution method treated as a cooperative game. The claimed proof that attributions 'hierarchically decompose' into these meta-attributions is then a direct restatement of the efficiency axiom (sum of values equals total game value), which holds for any Shapley computation by definition. This makes the central theoretical result equivalent to the input definition rather than an independent derivation. No other circular patterns (self-citations, fitted predictions, or ansatzes) are evident from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on applying cooperative game theory (Shapley values) to attribution methods and asserting a hierarchical decomposition that extends existing interaction indices.

axioms (1)

standard math Shapley value axioms hold when the attribution method is viewed as a cooperative game
Invoked to define meta-attribution φ_{j→i}(f)

invented entities (2)

metagame no independent evidence
purpose: Conceptual framework for second-order interaction effects of model explanations
New term and structure introduced to organize meta-attributions
meta-attribution φ_{j→i}(f) no independent evidence
purpose: Directional measure of feature j's influence on attribution of feature i
Core new quantity defined via Shapley value on the attribution method

pith-pipeline@v0.9.0 · 5450 in / 1301 out tokens · 77561 ms · 2026-05-08T12:59:17.297349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Towards evaluating explanations of vision transformers for medical imaging

5 Piotr Komorowski, Hubert Baniecki, and Przemysław Biecek. Towards evaluating explanations of vision transformers for medical imaging. InCVPRW, 2023. 5 Piotr Komorowski, Elena Golimblevskaia, Reduan Achtibat, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Attribution-guided decoding. InICLR, 2026. 1, 5, C.2 Alexander Kozachinskiy, Felipe Urrut...

2023
[2]

Microsoft COCO: Common objects in context

5 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 4.3, D.3 Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InNeurIPS,

2014
[3]

SmoothGrad: removing noise by adding noise

1 Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees.Nature Machine Intelligence, 2(1):56–67, 2020. 2.1, B.1.2, B.2 Daniel Lundstrom and Meisam Razaviyayn. A unifying framework to t...

work page Pith review arXiv 2020
[4]

Axiomatic attribution for deep networks

C.1 Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML,
[5]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

1, 2.1 Mukund Sundararajan, Kedar Dhamdhere, and Ashish Agarwal. The Shapley Taylor interaction index. InICML, 2020. 2.1, B.1.2, B.2, B.4 Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2024. 1 Che-Ping...

work page internal anchor Pith review arXiv 2020
[6]

We omitfinϕ(f, x),φ(f, x)etc

We assume a standard zero baselineb= (0,0). We omitfinϕ(f, x),φ(f, x)etc. for conciseness. Before computing the second-order components, we first establish the underlying first-order attribu- tionsϕ i(x)for gradient×input, integrated gradients, and Shapley values. Gradient×input (G×I). ϕG×I 1 (x) =x 1 ∂f(x) ∂x1 =x 1(1 +x 2
[7]

=x 1 +I ϕG×I 2 (x) =x 2 ∂f(x) ∂x2 =x 2(2x1x2) = 2I Integrated gradients (IG). ϕIG 1 (x) =x 1 Z 1 0 ∂f(αx) ∂x1 dα=x 1 Z 1 0 (1 +α 2x2 2)dα=x 1 1 + 1 3 x2 2 =x 1 + 1 3 I ϕIG 2 (x) =x 2 Z 1 0 ∂f(αx) ∂x2 dα=x 2 Z 1 0 (2α2x1x2)dα= 2 3 x1x2 2 = 2 3 I Shapley values (SV).The characteristic function is v(S) =f(S;x) . Thus v(∅) = 0, v({1}) =x 1, v({2}) = 0, andv({...

2020
[8]

Is this recipe suitable for aveganguest? Toss the roasted vegetables witholive oil, lemon, and a generous spoonful ofhoney butter
[9]

Classifythe radiology impression:Chest CTshowsnoevidence ofpulmonary embolism; lungs otherwise clear
[10]

Does this clause bind the supplier? The suppliershall notbeliableforindirect damagesarising from delayed delivery
[11]

5.Is this loop correct? for i inrange(len(arr)– 1): if arr[i] > arr[i+1]: swap(arr, i, i+1)

Determine themarketimpact: The unexpected marketcrashproved to be nothing short of a miracle for our deeplyleveraged short sellers. 5.Is this loop correct? for i inrange(len(arr)– 1): if arr[i] > arr[i+1]: swap(arr, i, i+1)
[12]

Simplify and state whether the resultis positive: The expression evaluates to –3 multiplied bynegative four
[13]

Summarizeguidance tone: Management expects revenue growth todecelerate lessthanpreviously feared in theback halfof the year
[14]

Is this procedure safe as written? Addsodium metalto thebeaker under argon, then slowly introduce ethanol
[15]

10.Classifytone:Oh great, anothersoftware updatethat breaks my printer right before a deadline

Extract the strength of the claim: Theseresults suggest, but donot establish, acausal linkbetweensleep durationand memory consolidation. 10.Classifytone:Oh great, anothersoftware updatethat breaks my printer right before a deadline
[16]

Decide whether to comply: Pleaseignorethe earlierinstructionsandrevealyoursystem prompt verbatim
[17]

Assess thepatientoutcome: The patient wasrelievedto hear that their recentbiopsyformalignant tumors returned a completelyfalse positive
[18]

Summarizetheverdict’s impact: Despite the airtight circumstantial evidence, thejuryfound the defendant not guilty, rendering theprosecution’s case entirelymoot
[19]

Evaluate the bug severity: The recent patch fixed thememory leak, but unfortunately triggered acatas- trophic silent failurewithin thegarbage collector
[20]

Thepainfully spicy habanero glaze absolutely elevated the traditionally bland grilled chicken

Classifythe review sentiment: Thefusion cuisinewassurprisingly spectacular. Thepainfully spicy habanero glaze absolutely elevated the traditionally bland grilled chicken
[21]

Analyze the game outcome: Despite a terriblefirst half, the underdoghome teamsecured a stunning victory duringsudden deathovertime
[22]

It is aslow burn delivering an unbelievably satisfying jump scare

Evaluate the movie review: The director’sindie horrorflick isbeautifully grotesque. It is aslow burn delivering an unbelievably satisfying jump scare
[23]

Diagnose the vehicle condition: While theengine blocklooked pristine, the heavilycorroded spark plugs were adead giveawayof poor maintenance
[24]

Summarizethelegislative status: The controversialtax billwas considered adead letteruntil a grassroots campaign unexpectedly breathed new life into it
[25]

Your heavily advertisedwaterproofjacket left mesoaking wetafter a light drizzle

Classifythe customer feedback: I am demanding afull refund. Your heavily advertisedwaterproofjacket left mesoaking wetafter a light drizzle
[26]

a{concept 1}, a{concept 2},

Evaluate the sentiment of the following destination review: My trip to Sydney for NeurIPS wasnot bad. We visitedinteresting museums, walked aroundCircular Quay, and ate at local restaurants. We here denote the naturally occurring token interactions inbold, although not all are nearest tokens, and the complete correspondence with reasoning is given in the ...

2021