Adalina: Adaptive Linear Approximation for the Shapley Value and Beyond

Bryan Kian Hsiang Low; Weida Li; Yaoliang Yu

arxiv: 2604.08438 · v2 · pith:JPBADMBSnew · submitted 2026-04-09 · 💻 cs.LG

Adalina: Adaptive Linear Approximation for the Shapley Value and Beyond

Weida Li , Yaoliang Yu , Bryan Kian Hsiang Low This is my paper

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords Shapley valuesemi-valuesapproximation algorithmsrandomized algorithmsmean squared erroradaptive samplinglinear space

0 comments

The pith

A linear-space algorithm approximates Shapley values and semi-values with O(n/ε² log(1/δ)) queries while minimizing mean squared error for any utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a theoretical framework, based on a vector concentration inequality, that delivers provable query bounds for approximating the entire family of semi-values under a strict linear-space limit. It shows this bound suffices to guarantee L2 error at most ε with probability 1-δ for every common semi-value, including the Shapley value itself. The same framework lets the sampler adapt its sampling probabilities to the specific utility function at hand, thereby reducing mean squared error below what fixed-sampling methods achieve. This matters because exact computation of these attribution scores requires an exponential number of utility evaluations, rendering them unusable for large player sets that arise in machine learning explainability and cooperative-game applications.

Core claim

By applying a vector concentration inequality to unbiased randomized estimators of semi-values, the authors derive an O(n/ε² log(1/δ)) query complexity that simultaneously guarantees the stated high-probability L2 bound for all common semi-values and permits explicit minimization of mean squared error for any given utility. The resulting adaptive algorithm, Adalina, runs in linear time and linear space and unifies several prior unbiased estimators including OFA, kernelSHAP, and SHAP-IQ while characterizing when paired sampling helps.

What carries the argument

A vector concentration inequality applied to linear-space unbiased estimators of semi-values; it supplies both the query-complexity bound and the mechanism for choosing sampling weights that minimize mean squared error for a concrete utility.

If this is right

All commonly used semi-values, including Shapley, Banzhaf, and others, receive the same query bound.
Paired sampling improves the bound only for certain utility functions, which the framework identifies explicitly.
Mean squared error can be minimized directly for any fixed utility without changing the linear space or time guarantees.
Existing unbiased methods (OFA, kernelSHAP, SHAP-IQ, regression-adjusted) are recovered as special cases inside the same analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same concentration-based analysis could be applied to other solution concepts in cooperative games that admit linear unbiased estimators.
In large-scale feature attribution, the adaptive MSE minimization may produce noticeably lower error than non-adaptive baselines once n exceeds a few hundred players.
A deterministic derandomization of the adaptive weights would remove the remaining logarithmic factor in the failure probability.

Load-bearing premise

The vector concentration inequality must apply with the claimed tightness to the particular semi-value estimators under the linear-space constraint.

What would settle it

Run the algorithm on a sequence of increasing n and decreasing ε and check whether the observed failure rate or mean squared error deviates from the predicted O(n/ε² log(1/δ)) scaling on standard benchmark utility functions.

Figures

Figures reproduced from arXiv: 2604.08438 by Bryan Kian Hsiang Low, Weida Li, Yaoliang Yu.

**Figure 1.** Figure 1: The relative approximation error of unbiased kernelSHAP in approximating the Shapley value with different sampling distributions. Here, λ = u[n]−u∅ n . The dashed lines correspond to those with paired sampling, whereas the solid lines are without paired sampling. For the first row, the utility function there is positive, whereas for the second row, U(S) can be either positive or negative. Paired sampling. … view at source ↗

**Figure 2.** Figure 2: The relative approximation error of ϕˆ γ in approximating the Shapley value as γ varies. This suggests that the modified unbiased KernelSHAP is superior in terms of query complexity. Remark. We notice that Chen et al. (2025, Proposition E.1) constructed a specific utility function to show that the modified unbiased kernelSHAP could behave worse by a multiplicative factor of √ n compared to the variant usin… view at source ↗

**Figure 3.** Figure 3: The relative approximation error of SHAP-IQ, ϕˆ γ and its adaptive variant. The adaptive method corresponds to Adalina, presented in Algorithm 1, which adaptively estimates the optimal γ. Weighted Banzhaf values are parameterized by w ∈ (0, 1), whereas Beta Shapley values are parameterized by (α, β), with (1, 1) corresponding to the Shapley value. then by Theorem 3.1, E[∥ϕˆ γ − ϕ∥ 2 2 ] = nD∗E[(uS − γ) 2 ]… view at source ↗

**Figure 4.** Figure 4: The relative approximation error of different randomized algorithms. Weighted Banzhaf values are parameterized by w ∈ (0, 1), whereas Beta Shapley values are parameterized by (α, β), with (1, 1) corresponding to the Shapley value. Utility functions. Each utility function is defined using a trained (gradient boosting) decision tree f and an instance x ∈ R n. Specifically, U x f (S) := EX[n]\S [f(xS, X[n]\S)… view at source ↗

**Figure 5.** Figure 5: The relative approximation error of different randomized algorithms. Weighted Banzhaf values are parameterized by w ∈ (0, 1), whereas Beta Shapley values are parameterized by (α, β), with (1, 1) corresponding to the Shapley value. C. Experiments For kernelSHAP, we set q = q ∗ and λ = u[n]−u∅ n , a combination that is empirically the best according to Chen et al. (2025), as confirmed in [PITH_FULL_IMAGE:fi… view at source ↗

read the original abstract

The Shapley value, and its broader family of semi-values, has received much attention in various attribution problems. A fundamental and long-standing challenge is their efficient approximation, since exact computation generally requires an exponential number of utility queries in the number of players $n$. To meet the challenges of large-scale applications, we explore the limits of efficiently approximating semi-values under a $\Theta(n)$ space constraint. Building upon a vector concentration inequality, we establish a theoretical framework that enables sharper query complexities for existing unbiased randomized algorithms. Within this framework, we systematically develop a linear-space algorithm that requires $O(\frac{n}{\epsilon^{2}}\log\frac{1}{\delta})$ utility queries to ensure $P(\|\hat{\boldsymbol\phi}-\boldsymbol\phi\|_{2}\geq\epsilon)\leq \delta$ for all commonly used semi-values. In particular, our framework naturally bridges OFA, unbiased kernelSHAP, SHAP-IQ and the regression-adjusted approach, and definitively characterizes when paired sampling is beneficial. Moreover, our algorithm allows explicit minimization of the mean squared error $\mathbb{E}[\|\hat{\boldsymbol\phi}-\boldsymbol\phi\|_{2}^{2}]$ for each specific utility function. Accordingly, we introduce the first adaptive, linear-time, linear-space randomized algorithm, Adalina, that theoretically achieves improved mean squared error. All of our theoretical findings are experimentally validated. Our code is available at https://github.com/watml/adalina.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adalina adapts sampling to minimize MSE for semi-value approximation under linear space and gives a unifying framework, but the vector concentration inequality needs verification for the adaptive dependent estimators.

read the letter

The main thing to know is that this paper introduces Adalina, an adaptive algorithm that chooses sampling distributions to reduce mean squared error for a given utility while staying linear in space and hitting the O(n/ε² log(1/δ)) query bound for the L2 error probability guarantee across common semi-values. It also maps out when paired sampling helps and ties several prior methods together under one analysis.

Referee Report

2 major / 2 minor

Summary. The paper develops a theoretical framework based on vector concentration inequalities to derive sharper query complexities for unbiased randomized algorithms approximating semi-values (including Shapley) under a Θ(n) space constraint. It claims an O(n/ε² log(1/δ)) utility query bound guaranteeing P(‖φ̂ - φ‖₂ ≥ ε) ≤ δ uniformly for common semi-values, introduces the adaptive linear-time linear-space algorithm Adalina that explicitly minimizes MSE for specific utility functions, bridges OFA/kernelSHAP/SHAP-IQ/regression-adjusted methods, characterizes when paired sampling helps, and validates all claims experimentally.

Significance. If the central guarantees hold, the work would be significant for large-scale attribution by delivering the first adaptive linear-time linear-space randomized algorithm with explicit per-utility MSE minimization and uniform high-probability bounds, while providing a unifying framework that explains existing methods and improves upon non-adaptive baselines.

major comments (2)

[§3] §3 (theoretical framework and query-complexity derivation): the O(n/ε² log(1/δ)) bound is obtained by invoking a vector concentration inequality on the semi-value estimators; however, these estimators are linear combinations of utility evaluations whose sampling distributions depend on the semi-value and whose variances depend on the unknown utility function. The manuscript does not explicitly verify that the adaptive sampling scheme satisfies the coordinate-wise sub-Gaussianity or moment conditions required by the cited inequality, which is load-bearing for both the stated complexity and the uniformity claim across semi-values.
[Theorem on query complexity] Theorem establishing the uniform bound (around the vector-concentration application): the claim that the same O(n/ε² log(1/δ)) query complexity holds for all commonly used semi-values follows only if the inequality transfers directly to each estimator family; without a per-semi-value check of the required tail conditions under the Θ(n)-space adaptive scheme, the uniformity does not necessarily follow from the cited inequality alone.

minor comments (2)

[Abstract] Abstract: the statement that 'all of our theoretical findings are experimentally validated' is not accompanied by any description of the experimental protocol (range of n, choice of ε/δ, utility functions, or baselines), making it impossible to assess whether the experiments actually probe the adaptive MSE minimization or the concentration bound.
[Notation] Notation section: the definitions of the sampling probabilities for each semi-value and the precise form of the linear estimator could be stated more explicitly to allow readers to verify the variance expressions used in the MSE minimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. The comments on the theoretical framework are well-taken and point to places where additional explicit verification will strengthen the presentation. We address each major comment below and will incorporate the clarifications in the revised version.

read point-by-point responses

Referee: [§3] §3 (theoretical framework and query-complexity derivation): the O(n/ε² log(1/δ)) bound is obtained by invoking a vector concentration inequality on the semi-value estimators; however, these estimators are linear combinations of utility evaluations whose sampling distributions depend on the semi-value and whose variances depend on the unknown utility function. The manuscript does not explicitly verify that the adaptive sampling scheme satisfies the coordinate-wise sub-Gaussianity or moment conditions required by the cited inequality, which is load-bearing for both the stated complexity and the uniformity claim across semi-values.

Authors: We thank the referee for identifying this important point. The original manuscript invokes the vector concentration inequality on the unbiased linear estimators after establishing their sampling distributions for each semi-value, but does not include an explicit per-semi-value verification that the adaptive Θ(n)-space scheme preserves the required coordinate-wise sub-Gaussianity or moment bounds. While the adaptation is variance-reducing and the per-coordinate sampling probabilities are fixed by the semi-value definition (independent of the unknown utility), we agree this transfer was not spelled out. In the revision we will add a dedicated subsection (or appendix) that verifies the sub-Gaussian parameters for each common semi-value under the adaptive scheme, confirming that the same O(n/ε² log(1/δ)) bound holds. revision: yes
Referee: [Theorem on query complexity] Theorem establishing the uniform bound (around the vector-concentration application): the claim that the same O(n/ε² log(1/δ)) query complexity holds for all commonly used semi-values follows only if the inequality transfers directly to each estimator family; without a per-semi-value check of the required tail conditions under the Θ(n)-space adaptive scheme, the uniformity does not necessarily follow from the cited inequality alone.

Authors: We agree that uniformity across semi-values requires confirming that the tail conditions transfer to each estimator family under the adaptive scheme. The framework derives the bound from the general structure of unbiased linear combinations with semi-value-specific sampling distributions and bounded utility assumptions; the adaptive component minimizes empirical MSE without altering the fixed sampling probabilities per coordinate. To make the uniformity rigorous, the revision will include explicit per-semi-value checks (Shapley, Banzhaf, etc.) showing that the moment conditions are satisfied with constants independent of the utility function, thereby justifying the uniform O(n/ε² log(1/δ)) claim. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived from external inequality and explicit optimization

full rationale

The paper constructs its query complexity O(n/ε² log(1/δ)) and Adalina algorithm by invoking an external vector concentration inequality, then applying it within a new framework to existing unbiased estimators (OFA, kernelSHAP, etc.) and performing explicit MSE minimization over sampling distributions for each utility function. No step equates a claimed prediction or bound to a fitted parameter, self-defined quantity, or self-citation chain by the paper's own equations. The derivation remains independent of the target result and does not rename known patterns or smuggle ansatzes via author citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides limited detail; the central claims rest on an external vector concentration inequality and the linear-space constraint. No free parameters, invented entities, or additional axioms are described.

axioms (1)

domain assumption A vector concentration inequality applies to the unbiased estimators of the semi-values.
Invoked to obtain the O(n/ε² log(1/δ)) query bound.

pith-pipeline@v0.9.0 · 5530 in / 1296 out tokens · 32966 ms · 2026-05-10T18:33:07.174134+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Building upon a vector concentration inequality, we establish a theoretical framework that enables sharper query complexities... O(n/ε² log 1/δ) utility queries... Adalina... minimizes the mean square error
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.1 (Vector concentration inequality)... P(‖1/M Σ Xi‖₂ ≥ ε) ≤ 2 exp(−M ε²/(4σ²))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.