Efficient Banzhaf-Based Data Valuation for k-Nearest Neighbors Classification
Pith reviewed 2026-05-21 05:43 UTC · model grok-4.3
The pith
Banzhaf values for k-nearest neighbor classifiers can be computed exactly in linear time using dynamic programming.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop a dynamic programming framework that computes exact Banzhaf values for kNN classifiers by leveraging that only the top-k neighbors determine each prediction. This yields a pseudo-polynomial time algorithm of O(W k n squared) for weighted kNN where W is the max sum of top-k weights, and O(n k squared) for unweighted kNN, which is linear in the number of points. We also prove the problem is #P-hard and provide Monte Carlo estimation methods.
What carries the argument
A dynamic programming framework that encodes the top-k locality of kNN predictions to compute exact Banzhaf contributions without full coalition enumeration.
If this is right
- Exact Banzhaf values become computable for large datasets using kNN.
- The unweighted algorithm runs in time linear in the number of data points.
- Monte Carlo methods offer efficient approximations with guarantees.
- Data valuation can be applied practically to identify important training points for kNN models.
Where Pith is reading between the lines
- Similar locality-based DP might apply to other distance-based or local models in ML.
- These valuations could support data selection or cleaning pipelines for improved model performance.
- Testing on synthetic data where full enumeration is possible would validate the exactness.
Load-bearing premise
The prediction in kNN depends only on the top-k neighbors in a manner that allows tracking their contributions through dynamic programming without missing interactions or adding error.
What would settle it
Compare the Banzhaf values computed by the DP algorithm against brute-force summation over all 2^n subsets on a small dataset with n around 20 to check for exact match.
Figures
read the original abstract
Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that exact Banzhaf-value computation for kNN data valuation is #P-hard and then presents dynamic-programming algorithms that exploit the locality of kNN predictions (only the top-k neighbors matter) to obtain a pseudo-polynomial O(W k n²) algorithm for weighted kNN and an O(n k²) algorithm for the unweighted case, together with Monte-Carlo estimators; experiments on real data are reported to demonstrate practical speed-ups.
Significance. If the stated running times are correct and the algorithms are free of off-by-one or boundary errors, the work would supply the first exact, non-trivial polynomial-time methods for a concrete, widely used model family, moving data valuation from exponential enumeration or crude sampling toward practical exact computation for kNN.
major comments (2)
- [Abstract / §3] Abstract and §3 (presumably the complexity claims): the stated O(n k²) bound for the unweighted case is presented as linear in the number of training points n. However, because each of the m test points induces a distinct distance ordering, the DP must be run separately for every test point; the total time is therefore Θ(m n k²) in the worst case. The manuscript neither states an assumption that m = O(1) nor supplies a shared-DP argument that would remove the m factor. This directly affects the central claim of practicality.
- [§2] §2 (hardness proof): the #P-hardness argument is asserted but no reduction, no explicit construction of the utility function, and no verification that the reduction preserves the kNN locality structure are supplied. Without these steps it is impossible to confirm that the hardness result applies to the exact setting the DP algorithms later solve.
minor comments (2)
- [§3] The recurrence relations and base cases for the DP are described at a high level; pseudocode or an explicit state-transition table would make verification of boundary conditions straightforward.
- [§1 / §3] Notation for the test-set size m and its appearance (or absence) in the complexity expressions should be introduced consistently from the problem definition onward.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. We respond to each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and §3 (presumably the complexity claims): the stated O(n k²) bound for the unweighted case is presented as linear in the number of training points n. However, because each of the m test points induces a distinct distance ordering, the DP must be run separately for every test point; the total time is therefore Θ(m n k²) in the worst case. The manuscript neither states an assumption that m = O(1) nor supplies a shared-DP argument that would remove the m factor. This directly affects the central claim of practicality.
Authors: We agree that the O(n k²) bound applies to a single test point. The manuscript focuses on linearity in the training-set size n because that is the dominant term in typical data-valuation settings where one evaluates contributions to a fixed prediction. Nevertheless, when m test points have distinct orderings the total cost is O(m n k²). We will revise the abstract and §3 to state the per-test-point complexity explicitly, to note the linear dependence on m, and to discuss whether a shared-DP table across test points can be maintained (preliminary checks suggest only limited sharing is possible). revision: yes
-
Referee: [§2] §2 (hardness proof): the #P-hardness argument is asserted but no reduction, no explicit construction of the utility function, and no verification that the reduction preserves the kNN locality structure are supplied. Without these steps it is impossible to confirm that the hardness result applies to the exact setting the DP algorithms later solve.
Authors: We apologize for the insufficient detail. The proof reduces from the #P-complete problem of counting subsets whose weighted sum lies in a given range; the utility function is defined so that the Banzhaf value encodes the number of such subsets that cause a particular label to be among the k nearest neighbors. We will expand §2 with the full reduction, the explicit utility-function construction, and a short argument showing that the reduction respects kNN locality (only the top-k neighbors affect the utility). This will confirm that the hardness applies precisely to the problem solved by the subsequent DP algorithms. revision: yes
Circularity Check
No circularity in derivation; complexity claims rest on explicit DP construction
full rationale
The paper first proves #P-hardness via reduction, then constructs a dynamic program whose states track position in distance order, number of selected points up to k, and vote tally. Transitions follow directly from the definition of kNN (only top-k neighbors contribute) and the Banzhaf summation formula. No parameter is fitted and then renamed as a prediction, no self-citation supplies a uniqueness theorem, and the stated O(n k²) bound is obtained by counting states and transitions per test point without circular reduction to the input utility function. The derivation is self-contained against the standard definition of Banzhaf values and kNN locality.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Banzhaf value computation is #P-hard for general cooperative games.
- domain assumption kNN classifier predictions are determined solely by the k nearest training points under the chosen distance.
Reference graph
Works this paper leans on
-
[1]
Yoram Bachrach, Evangelos Markakis, Ezra Resnick, Ariel D Procaccia, Jeffrey S Rosenschein, and Amin Saberi. 2010. Approximating power indices: theoretical and empirical analysis.Autonomous Agents and Multi-Agent Systems20, 2 (2010), 105–122
work page 2010
-
[2]
John F Banzhaf III. 1965. Weighted voting doesn’t work: A mathematical analysis. Rutgers Law Review19 (1965), 317–343
work page 1965
-
[3]
Olivier Bousquet and André Elisseeff. 2002. Stability and generalization.Journal of machine learning research2, Mar (2002), 499–526
work page 2002
-
[4]
Javier Castro, Daniel Gómez, and Juan Tejada. 2009. Polynomial calculation of the Shapley value based on sampling.Comput. Oper. Res.36, 5 (2009), 1726–1730
work page 2009
-
[5]
Dennis Cook and Sanford Weisberg
R. Dennis Cook and Sanford Weisberg. 1982.Residuals and Influence in Regression. Chapman and Hall, New York, NY, USA
work page 1982
-
[6]
Xiaotie Deng and Christos H. Papadimitriou. 1994. On the Complexity of Coop- erative Solution Concepts.Math. Oper. Res.19, 2 (1994), 257–266. 12
work page 1994
-
[7]
Edith Elkind and Jörg Rothe. 2016. Cooperative game theory.Economics and computation: an introduction to algorithmic game theory, computational social choice, and fair division(2016), 135–193
work page 2016
-
[8]
S. Shaheen Fatima, Michael J. Wooldridge, and Nicholas R. Jennings. 2008. A linear approximation method for the Shapley value.Artif. Intell.172, 14 (2008), 1673–1699
work page 2008
-
[9]
Vitaly Feldman and Chiyuan Zhang. 2020. What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation.Advances in Neural Information Processing Systems33 (2020), 2881–2891
work page 2020
-
[10]
Amirata Ghorbani and James Y. Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. InProceedings of the 36th International Conference on Machine Learning. PMLR, 2242–2251
work page 2019
-
[11]
2002.A distribution-free theory of nonparametric regression
László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002.A distribution-free theory of nonparametric regression. Springer
work page 2002
-
[12]
Zayd Hammoudeh and Daniel Lowd. 2024. Training data influence analysis and estimation: a survey.Mach. Learn.113, 5 (2024), 2351–2403
work page 2024
-
[13]
Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning. PMLR, 1225–1234
work page 2016
-
[14]
Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Alek- sander Madry. 2022. Datamodels: Understanding Predictions with Data and Data with Predictions. InProceedings of the 39th International Conference on Machine Learning. PMLR, 9525–9587
work page 2022
-
[15]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gürel, Bo Li, Ce Zhang, Costas J. Spanos, and Dawn Song. 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms.Proc. VLDB Endow.12, 11 (2019), 1610–1623
work page 2019
-
[16]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. 2019. Towards efficient data valuation based on the shapley value. InThe 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1167–1176
work page 2019
-
[17]
Kevin Fu Jiang, Weixin Liang, James Y. Zou, and Yongchan Kwon. 2023. Open- DataVal: a Unified Benchmark for Data Valuation.Advances in Neural Information Processing Systems36 (2023), 28624–28647
work page 2023
-
[18]
Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, and Ruoxi Jia. 2023. LAVA: Data Valuation without Pre-Specified Learning Algorithms. InThe Eleventh International Conference on Learning Representations. OpenReview
work page 2023
-
[19]
Adam Karczmarz, Tomasz Michalak, Anish Mukherjee, Piotr Sankowski, and Piotr Wygocki. 2022. Improved feature importance computation for tree models based on the Banzhaf value. InUncertainty in Artificial Intelligence. PMLR, 969– 979
work page 2022
-
[20]
Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. 2023. The UCI Machine Learning Repository. https://archive.ics.uci.edu/
work page 2023
-
[21]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. InProceedings of the 34th International Conference on Machine Learning. PMLR, 1885–1894
work page 2017
-
[22]
Yongchan Kwon and James Zou. 2022. Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics. PMLR, 8780–8802
work page 2022
-
[23]
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8424–8445
work page 2022
-
[24]
Weida Li and Yaoliang Yu. 2023. Robust data valuation with weighted banzhaf values.Advances in Neural Information Processing Systems36 (2023), 60349– 60383
work page 2023
-
[25]
Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2020. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence2, 1 (2020), 56–67
work page 2020
-
[26]
Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers
-
[27]
Bounding the Estimation Error of Sampling-based Shapley Value Approximation
Bounding the Estimation Error of Sampling-based Shapley Value Approxi- mation. arXiv:1306.4265 [cs.GT]
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Tomomi Matsui and Yasuko Matsui. 2000. A survey of algorithms for calculating power indices of weighted majority games.Journal of the Operations Research Society of Japan43, 1 (2000), 71–86
work page 2000
-
[29]
Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. Sampling permutations for shapley value estimation.Journal of Machine Learning Research 23, 43 (2022), 1–46
work page 2022
-
[30]
Kislaya Prasad and Jerry S Kelly. 1990. NP-completeness of some problems concerning voting games.International Journal of Game Theory19 (1990), 1–9
work page 1990
-
[31]
Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. 2020. Estimating Training Data Influence by Tracing Gradient Descent.Advances in Neural Information Processing Systems33 (2020), 19920–19930
work page 2020
-
[32]
Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2022. Interpretable machine learning: Fundamental principles and 10 grand challenges.Statistic Surveys16 (2022), 1–85
work page 2022
-
[33]
Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. 2022. Scaling Up Influence Functions.Proceedings of the AAAI Conference on Artificial Intelligence36, 8 (2022), 8179–8186
work page 2022
-
[34]
Lloyd S. Shapley. 1953. A Value for n-Person Games. InContributions to the Theory of Games, Volume II. Princeton University Press, Princeton, NJ, USA, 307–318
work page 1953
-
[35]
Rachael Hwee Ling Sim, Xinyi Xu, and Bryan Kian Hsiang Low. 2022. Data Valuation in Machine Learning: “Ingredients”, Strategies, and Open Challenges. InProceedings of the Thirty-First International Joint Conference on Artificial In- telligence, IJCAI-22. International Joint Conferences on Artificial Intelligence Organization, 5607–5614
work page 2022
-
[36]
Takeaki Uno. 2012. Efficient computation of power indices for weighted majority games. InAlgorithms and Computation: 23rd International Symposium, ISAAC 2012, Taipei, Taiwan, December 19-21, 2012. Proceedings 23. Springer, 679–689
work page 2012
-
[37]
Rene Van den Brink and Gerard Van der Laan. 1998. Axiomatizations of the normalized Banzhaf value and the Shapley value.Social Choice and Welfare15, 4 (1998), 567–582
work page 1998
-
[38]
van Rijn, Bernd Bischl, and Luis Torgo
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: networked science in machine learning.SIGKDD Explorations15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
-
[39]
Diego Varela and Javie Prado-Dominguez. 2012. Negotiating the Lisbon Treaty: Redistribution, Efficiency and Power Indices.AUCO Czech Economic Review6, 2 (2012)
work page 2012
-
[40]
Jiachen T. Wang and Ruoxi Jia. 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. InProceedings of The 26th International Con- ference on Artificial Intelligence and Statistics. PMLR, 6388–6421
work page 2023
-
[41]
Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms
Jiachen T. Wang and Ruoxi Jia. 2023. A Note on “Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms”. arXiv:2304.04258 [stat.ML]
-
[42]
Wang, Prateek Mittal, and Ruoxi Jia
Jiachen T. Wang, Prateek Mittal, and Ruoxi Jia. 2024. Efficient Data Shapley for Weighted Nearest Neighbor Algorithms. InProceedings of The 27th International Conference on Artificial Intelligence and Statistics. PMLR, 2557–2565
work page 2024
-
[43]
Jiachen T Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. 2025. Data Shapley in One Training Run. InThe Thirteenth International Conference on Learning Representations. OpenReview.net
work page 2025
-
[44]
Tom Yan and Ariel D Procaccia. 2021. If You Like Shapley Then You’ll Love the Core.Proceedings of the AAAI Conference on Artificial Intelligence35, 6 (2021), 5751–5759
work page 2021
-
[45]
Jiayao Zhang, Qiheng Sun, Jinfei Liu, Li Xiong, Jian Pei, and Kui Ren. 2023. Efficient sampling approaches to shapley value approximation.Proceedings of the ACM on Management of Data1, 1 (2023), 1–24. 13 Algorithm A1:Specialized dynamic programming algorithm for unweighted𝑘NN Input:Dataset𝐷, test point𝑧 test, integer𝑘 1Sort𝑧 1, . . . , 𝑧𝑛 by non-decreasin...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.