pith. sign in

arxiv: 2605.21033 · v1 · pith:S4ELLAMKnew · submitted 2026-05-20 · 💻 cs.LG · cs.DS

Efficient Banzhaf-Based Data Valuation for k-Nearest Neighbors Classification

Pith reviewed 2026-05-21 05:43 UTC · model grok-4.3

classification 💻 cs.LG cs.DS
keywords data valuationBanzhaf valuek-nearest neighborsdynamic programmingcomputational complexitygame-theoretic valuationmachine learning algorithms
0
0 comments X

The pith

Banzhaf values for k-nearest neighbor classifiers can be computed exactly in linear time using dynamic programming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes efficient exact algorithms for computing Banzhaf values in kNN classification by exploiting the model's locality property. This addresses the exponential complexity of game-theoretic data valuation, which is proven #P-hard in general. A sympathetic reader would care because it makes fair assessment of individual data points' contributions feasible for a popular classifier without relying on approximations. The methods include a pseudo-polynomial DP for weighted cases and an O(n k squared) algorithm for unweighted ones, supported by Monte Carlo alternatives and experiments on real data.

Core claim

We develop a dynamic programming framework that computes exact Banzhaf values for kNN classifiers by leveraging that only the top-k neighbors determine each prediction. This yields a pseudo-polynomial time algorithm of O(W k n squared) for weighted kNN where W is the max sum of top-k weights, and O(n k squared) for unweighted kNN, which is linear in the number of points. We also prove the problem is #P-hard and provide Monte Carlo estimation methods.

What carries the argument

A dynamic programming framework that encodes the top-k locality of kNN predictions to compute exact Banzhaf contributions without full coalition enumeration.

If this is right

  • Exact Banzhaf values become computable for large datasets using kNN.
  • The unweighted algorithm runs in time linear in the number of data points.
  • Monte Carlo methods offer efficient approximations with guarantees.
  • Data valuation can be applied practically to identify important training points for kNN models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar locality-based DP might apply to other distance-based or local models in ML.
  • These valuations could support data selection or cleaning pipelines for improved model performance.
  • Testing on synthetic data where full enumeration is possible would validate the exactness.

Load-bearing premise

The prediction in kNN depends only on the top-k neighbors in a manner that allows tracking their contributions through dynamic programming without missing interactions or adding error.

What would settle it

Compare the Banzhaf values computed by the DP algorithm against brute-force summation over all 2^n subsets on a small dataset with n around 20 to check for exact match.

Figures

Figures reproduced from arXiv: 2605.21033 by Aristides Gionis, Guangyi Zhang, Lixu Wang, Lutz Oettershagen.

Figure 1
Figure 1. Figure 1: The values of all data points in a 2D dataset (Fig. 1a) are [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: The difference between the Shapley and Banzhaf values. The model performance is measured by the accuracy of a [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Computational efficiency of the proposed algorithms. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Point removal results on different datasets: model accuracy as the data points with the largest values are removed [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proves that exact Banzhaf-value computation for kNN data valuation is #P-hard and then presents dynamic-programming algorithms that exploit the locality of kNN predictions (only the top-k neighbors matter) to obtain a pseudo-polynomial O(W k n²) algorithm for weighted kNN and an O(n k²) algorithm for the unweighted case, together with Monte-Carlo estimators; experiments on real data are reported to demonstrate practical speed-ups.

Significance. If the stated running times are correct and the algorithms are free of off-by-one or boundary errors, the work would supply the first exact, non-trivial polynomial-time methods for a concrete, widely used model family, moving data valuation from exponential enumeration or crude sampling toward practical exact computation for kNN.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (presumably the complexity claims): the stated O(n k²) bound for the unweighted case is presented as linear in the number of training points n. However, because each of the m test points induces a distinct distance ordering, the DP must be run separately for every test point; the total time is therefore Θ(m n k²) in the worst case. The manuscript neither states an assumption that m = O(1) nor supplies a shared-DP argument that would remove the m factor. This directly affects the central claim of practicality.
  2. [§2] §2 (hardness proof): the #P-hardness argument is asserted but no reduction, no explicit construction of the utility function, and no verification that the reduction preserves the kNN locality structure are supplied. Without these steps it is impossible to confirm that the hardness result applies to the exact setting the DP algorithms later solve.
minor comments (2)
  1. [§3] The recurrence relations and base cases for the DP are described at a high level; pseudocode or an explicit state-transition table would make verification of boundary conditions straightforward.
  2. [§1 / §3] Notation for the test-set size m and its appearance (or absence) in the complexity expressions should be introduced consistently from the problem definition onward.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (presumably the complexity claims): the stated O(n k²) bound for the unweighted case is presented as linear in the number of training points n. However, because each of the m test points induces a distinct distance ordering, the DP must be run separately for every test point; the total time is therefore Θ(m n k²) in the worst case. The manuscript neither states an assumption that m = O(1) nor supplies a shared-DP argument that would remove the m factor. This directly affects the central claim of practicality.

    Authors: We agree that the O(n k²) bound applies to a single test point. The manuscript focuses on linearity in the training-set size n because that is the dominant term in typical data-valuation settings where one evaluates contributions to a fixed prediction. Nevertheless, when m test points have distinct orderings the total cost is O(m n k²). We will revise the abstract and §3 to state the per-test-point complexity explicitly, to note the linear dependence on m, and to discuss whether a shared-DP table across test points can be maintained (preliminary checks suggest only limited sharing is possible). revision: yes

  2. Referee: [§2] §2 (hardness proof): the #P-hardness argument is asserted but no reduction, no explicit construction of the utility function, and no verification that the reduction preserves the kNN locality structure are supplied. Without these steps it is impossible to confirm that the hardness result applies to the exact setting the DP algorithms later solve.

    Authors: We apologize for the insufficient detail. The proof reduces from the #P-complete problem of counting subsets whose weighted sum lies in a given range; the utility function is defined so that the Banzhaf value encodes the number of such subsets that cause a particular label to be among the k nearest neighbors. We will expand §2 with the full reduction, the explicit utility-function construction, and a short argument showing that the reduction respects kNN locality (only the top-k neighbors affect the utility). This will confirm that the hardness applies precisely to the problem solved by the subsequent DP algorithms. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; complexity claims rest on explicit DP construction

full rationale

The paper first proves #P-hardness via reduction, then constructs a dynamic program whose states track position in distance order, number of selected points up to k, and vote tally. Transitions follow directly from the definition of kNN (only top-k neighbors contribute) and the Banzhaf summation formula. No parameter is fitted and then renamed as a prediction, no self-citation supplies a uniqueness theorem, and the stated O(n k²) bound is obtained by counting states and transitions per test point without circular reduction to the input utility function. The derivation is self-contained against the standard definition of Banzhaf values and kNN locality.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard result that Banzhaf value computation is #P-hard in general and on the modeling assumption that kNN predictions depend only on a fixed number of nearest neighbors; no free parameters or new invented entities are introduced.

axioms (2)
  • standard math Banzhaf value computation is #P-hard for general cooperative games.
    Invoked to establish intractability before presenting the kNN-specific algorithms.
  • domain assumption kNN classifier predictions are determined solely by the k nearest training points under the chosen distance.
    This locality is the key property exploited by the dynamic program.

pith-pipeline@v0.9.0 · 5759 in / 1405 out tokens · 23789 ms · 2026-05-21T05:43:20.359916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Yoram Bachrach, Evangelos Markakis, Ezra Resnick, Ariel D Procaccia, Jeffrey S Rosenschein, and Amin Saberi. 2010. Approximating power indices: theoretical and empirical analysis.Autonomous Agents and Multi-Agent Systems20, 2 (2010), 105–122

  2. [2]

    John F Banzhaf III. 1965. Weighted voting doesn’t work: A mathematical analysis. Rutgers Law Review19 (1965), 317–343

  3. [3]

    Olivier Bousquet and André Elisseeff. 2002. Stability and generalization.Journal of machine learning research2, Mar (2002), 499–526

  4. [4]

    Javier Castro, Daniel Gómez, and Juan Tejada. 2009. Polynomial calculation of the Shapley value based on sampling.Comput. Oper. Res.36, 5 (2009), 1726–1730

  5. [5]

    Dennis Cook and Sanford Weisberg

    R. Dennis Cook and Sanford Weisberg. 1982.Residuals and Influence in Regression. Chapman and Hall, New York, NY, USA

  6. [6]

    Papadimitriou

    Xiaotie Deng and Christos H. Papadimitriou. 1994. On the Complexity of Coop- erative Solution Concepts.Math. Oper. Res.19, 2 (1994), 257–266. 12

  7. [7]

    Edith Elkind and Jörg Rothe. 2016. Cooperative game theory.Economics and computation: an introduction to algorithmic game theory, computational social choice, and fair division(2016), 135–193

  8. [8]

    Shaheen Fatima, Michael J

    S. Shaheen Fatima, Michael J. Wooldridge, and Nicholas R. Jennings. 2008. A linear approximation method for the Shapley value.Artif. Intell.172, 14 (2008), 1673–1699

  9. [9]

    Vitaly Feldman and Chiyuan Zhang. 2020. What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation.Advances in Neural Information Processing Systems33 (2020), 2881–2891

  10. [10]

    Amirata Ghorbani and James Y. Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. InProceedings of the 36th International Conference on Machine Learning. PMLR, 2242–2251

  11. [11]

    2002.A distribution-free theory of nonparametric regression

    László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002.A distribution-free theory of nonparametric regression. Springer

  12. [12]

    Zayd Hammoudeh and Daniel Lowd. 2024. Training data influence analysis and estimation: a survey.Mach. Learn.113, 5 (2024), 2351–2403

  13. [13]

    Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning. PMLR, 1225–1234

  14. [14]

    Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Alek- sander Madry. 2022. Datamodels: Understanding Predictions with Data and Data with Predictions. InProceedings of the 39th International Conference on Machine Learning. PMLR, 9525–9587

  15. [15]

    Spanos, and Dawn Song

    Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gürel, Bo Li, Ce Zhang, Costas J. Spanos, and Dawn Song. 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms.Proc. VLDB Endow.12, 11 (2019), 1610–1623

  16. [16]

    Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. 2019. Towards efficient data valuation based on the shapley value. InThe 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1167–1176

  17. [17]

    Zou, and Yongchan Kwon

    Kevin Fu Jiang, Weixin Liang, James Y. Zou, and Yongchan Kwon. 2023. Open- DataVal: a Unified Benchmark for Data Valuation.Advances in Neural Information Processing Systems36 (2023), 28624–28647

  18. [18]

    Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, and Ruoxi Jia. 2023. LAVA: Data Valuation without Pre-Specified Learning Algorithms. InThe Eleventh International Conference on Learning Representations. OpenReview

  19. [19]

    Adam Karczmarz, Tomasz Michalak, Anish Mukherjee, Piotr Sankowski, and Piotr Wygocki. 2022. Improved feature importance computation for tree models based on the Banzhaf value. InUncertainty in Artificial Intelligence. PMLR, 969– 979

  20. [20]

    Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. 2023. The UCI Machine Learning Repository. https://archive.ics.uci.edu/

  21. [21]

    Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. InProceedings of the 34th International Conference on Machine Learning. PMLR, 1885–1894

  22. [22]

    Yongchan Kwon and James Zou. 2022. Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics. PMLR, 8780–8802

  23. [23]

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8424–8445

  24. [24]

    Weida Li and Yaoliang Yu. 2023. Robust data valuation with weighted banzhaf values.Advances in Neural Information Processing Systems36 (2023), 60349– 60383

  25. [25]

    Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2020. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence2, 1 (2020), 56–67

  26. [26]

    Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers

  27. [27]

    Bounding the Estimation Error of Sampling-based Shapley Value Approximation

    Bounding the Estimation Error of Sampling-based Shapley Value Approxi- mation. arXiv:1306.4265 [cs.GT]

  28. [28]

    Tomomi Matsui and Yasuko Matsui. 2000. A survey of algorithms for calculating power indices of weighted majority games.Journal of the Operations Research Society of Japan43, 1 (2000), 71–86

  29. [29]

    Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. Sampling permutations for shapley value estimation.Journal of Machine Learning Research 23, 43 (2022), 1–46

  30. [30]

    Kislaya Prasad and Jerry S Kelly. 1990. NP-completeness of some problems concerning voting games.International Journal of Game Theory19 (1990), 1–9

  31. [31]

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. 2020. Estimating Training Data Influence by Tracing Gradient Descent.Advances in Neural Information Processing Systems33 (2020), 19920–19930

  32. [32]

    Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2022. Interpretable machine learning: Fundamental principles and 10 grand challenges.Statistic Surveys16 (2022), 1–85

  33. [33]

    Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. 2022. Scaling Up Influence Functions.Proceedings of the AAAI Conference on Artificial Intelligence36, 8 (2022), 8179–8186

  34. [34]

    Lloyd S. Shapley. 1953. A Value for n-Person Games. InContributions to the Theory of Games, Volume II. Princeton University Press, Princeton, NJ, USA, 307–318

  35. [35]

    Ingredients

    Rachael Hwee Ling Sim, Xinyi Xu, and Bryan Kian Hsiang Low. 2022. Data Valuation in Machine Learning: “Ingredients”, Strategies, and Open Challenges. InProceedings of the Thirty-First International Joint Conference on Artificial In- telligence, IJCAI-22. International Joint Conferences on Artificial Intelligence Organization, 5607–5614

  36. [36]

    Takeaki Uno. 2012. Efficient computation of power indices for weighted majority games. InAlgorithms and Computation: 23rd International Symposium, ISAAC 2012, Taipei, Taiwan, December 19-21, 2012. Proceedings 23. Springer, 679–689

  37. [37]

    Rene Van den Brink and Gerard Van der Laan. 1998. Axiomatizations of the normalized Banzhaf value and the Shapley value.Social Choice and Welfare15, 4 (1998), 567–582

  38. [38]

    van Rijn, Bernd Bischl, and Luis Torgo

    Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: networked science in machine learning.SIGKDD Explorations15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198

  39. [39]

    Diego Varela and Javie Prado-Dominguez. 2012. Negotiating the Lisbon Treaty: Redistribution, Efficiency and Power Indices.AUCO Czech Economic Review6, 2 (2012)

  40. [40]

    Wang and Ruoxi Jia

    Jiachen T. Wang and Ruoxi Jia. 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. InProceedings of The 26th International Con- ference on Artificial Intelligence and Statistics. PMLR, 6388–6421

  41. [41]

    Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

    Jiachen T. Wang and Ruoxi Jia. 2023. A Note on “Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms”. arXiv:2304.04258 [stat.ML]

  42. [42]

    Wang, Prateek Mittal, and Ruoxi Jia

    Jiachen T. Wang, Prateek Mittal, and Ruoxi Jia. 2024. Efficient Data Shapley for Weighted Nearest Neighbor Algorithms. InProceedings of The 27th International Conference on Artificial Intelligence and Statistics. PMLR, 2557–2565

  43. [43]

    Jiachen T Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. 2025. Data Shapley in One Training Run. InThe Thirteenth International Conference on Learning Representations. OpenReview.net

  44. [44]

    Tom Yan and Ariel D Procaccia. 2021. If You Like Shapley Then You’ll Love the Core.Proceedings of the AAAI Conference on Artificial Intelligence35, 6 (2021), 5751–5759

  45. [45]

    Jiayao Zhang, Qiheng Sun, Jinfei Liu, Li Xiong, Jian Pei, and Kui Ren. 2023. Efficient sampling approaches to shapley value approximation.Proceedings of the ACM on Management of Data1, 1 (2023), 1–24. 13 Algorithm A1:Specialized dynamic programming algorithm for unweighted𝑘NN Input:Dataset𝐷, test point𝑧 test, integer𝑘 1Sort𝑧 1, . . . , 𝑧𝑛 by non-decreasin...