pith. sign in

arxiv: 2605.10684 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Is Data Shapley Not Better than Random in Data Selection? Ask NASH

Pith reviewed 2026-05-13 05:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data selectionData Shapleysemivaluesnon-linear aggregationutility decompositionmachine learning
0
0 comments X

The pith

NASH improves data selection by decomposing the utility function into Shapley-informative components and aggregating them non-linearly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data Shapley can fail to pick better training subsets than random choice in some cases. The paper fixes this by breaking the target utility, such as validation accuracy, into simpler component functions where Shapley values remain reliable. These components are then combined through a non-linear objective that guides the final selection. The resulting subsets outperform both direct Shapley ranking and random sampling while adding only modest computation. A reader cares because the approach turns an unreliable valuation tool into a practical, low-overhead method for choosing high-quality training data.

Core claim

NASH decomposes the target utility function into simpler Shapley-informative component functions and selects data by optimizing an objective that aggregates these components non-linearly, substantially boosting the effectiveness of Shapley or semivalue-based data selection.

What carries the argument

The NASH framework: utility decomposition into Shapley-informative components followed by non-linear aggregation for subset selection.

If this is right

  • Shapley-based selection produces consistently higher-quality subsets across different utility functions.
  • The added runtime cost remains small enough for large-scale data selection tasks.
  • The same decomposition-plus-aggregation pattern extends to other semivalues.
  • Models trained on the selected subsets reach higher performance with less data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decomposition idea may apply to other interaction-aware valuation methods beyond semivalues.
  • It could clarify which utility functions naturally admit Shapley-informative breakdowns.
  • The pattern might transfer to multi-objective settings where each component captures a different performance aspect.

Load-bearing premise

The target utility can be decomposed into simpler component functions that are each Shapley-informative, and their non-linear aggregation will consistently outperform direct Shapley or random selection.

What would settle it

Train models on subsets chosen by NASH versus direct top-m Shapley versus random on the same dataset and measure whether NASH-selected subsets produce statistically lower or equal validation accuracy.

Figures

Figures reproduced from arXiv: 2605.10684 by Bryan Kian Hsiang Low, Jue Fan, Nancy F. Chen, Rachael Hwee Ling Sim, Xiao Tian, Zixuan Wang.

Figure 1
Figure 1. Figure 1: Overview of NASH data selection framework. When the utility function (e.g., validation accuracy uV ) is complex and involves different roles of data (e.g., 3 roles in the figure), Shapley values fail to inform on data with different strengths (e.g., different columns of data in the figure), and thus may perform badly. To address this, our NASH first decomposes uV to simpler, fewer￾role components where Sha… view at source ↗
Figure 2
Figure 2. Figure 2: Complex utility functions involving multiple roles (e.g., uV ) is not Shapley-informative. In this toy illustration, all data have the same Shapley value. Yet, if 1 has been selected, 3 would contribute more than 2 as role b is uncovered. which correspond to different sets of roles and require training data with different strengths (e.g., from every class or subpopulation). This raises two serious problems… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Shapley-Informativeness. 3a shows that validation accuracy is not Shapley-informative since the same sum of Shapley values uˆ can correspond to a range of actual utility u; 3b shows the distribution of marginal contributions when two data with different goodness (can be seen from their Shapley values) join different subsets, and they both demonstrate consistency; 3c shows that prediction co… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis of subsets selected by Data Shap￾ley and NASH. We use a 40% subset of our PO-LR setting (see Sec. 4) and show histograms of uˆv(M) on different validation data v ∈ V or roles. datum would have different contributions to different sets depending on the strengths of data in these sets [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Data selection performance using alternative utility functions. Fig. 6a uses validation loss; Fig. 6b and 6c consider regression tasks and use (negated) mean squared error given by RR as the utility function. NASH still consistently outperforms other baselines. 4.2 Heterogeneous-Quality Datasets In Sec. 3.2, we note that prior works (Ghorbani & Zou, 2019; Wang et al., 2024c) demonstrate that Data Shapley w… view at source ↗
Figure 5
Figure 5. Figure 5: General data selection performance. Different ratios of training data are selected and evaluated using validation accuracy. NASH consistently outperforms other baselines. does NASH solve this issue (we have argued that NASH is effective in this setting too as it solves P1 and P2)? The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: NASH is compatible with other semivalues. While other semivalues demonstrate ineffective performances similar to Data Shapley, NASH consistently improves over them. Data selection via top-m heuristic. Top-m heuristic (Sec. 1) is commonly adopted in data selection beyond Data Shapley and semivalues. For example, influence-based meth￾ods (Koh & Liang, 2017; Pruthi et al., 2020) are widely used to quantify th… view at source ↗
Figure 7
Figure 7. Figure 7: Data selection performance on MR-BT and MP-BT when label noises are added to the training set. As the amount of noise increases, NASH consistently outperforms Data Shapley although Data Shapley improves. teractions such as LOO, even when paired with NASH. Secondly, we discover that when the selection size is very small, semivalues that assign larger weights to smaller coali￾tions such as Beta(4, 1) could h… view at source ↗
Figure 9
Figure 9. Figure 9: Paired comparison of data i and j’s Shapley values ϕi(uv) and ϕj (uv) on each validation datum v ∈ V . Specifically, i and j are taken from the PO-LR setting, both with Data Shapley values ϕi(uV ) = ϕj (uV ) = 3.6×10−3 . Each scatter point corresponds to a validation datum v’s (ϕi(uv), ϕj (uv)). We also plot the line y = x (red dashed line). C.4 Empirical Justification of Consistent Player (Theorem 3.2) Th… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of the objective value given by greedily selected subsets vs. randomly selected subsets. The greedy algorithm consistently gives a much higher objective value. To gain more insights on why the greedy algorithm performs well for the NASH algorithm, we investigate the source of non-submodularity. Clearly, a large number of negative Shapley values come from the noisy/low-quality data with negative… view at source ↗
Figure 11
Figure 11. Figure 11: Data selection performance. Different ratios of training data are selected and evaluated using validation accuracy. NASH in general outperforms other baselines. Most of our experiments focus on classification problems with validation accuracy as the utility function. Nonetheless, NASH can naturally extend to other problems and is compatible with other utility functions. For example, NASH would prefer data… view at source ↗
Figure 12
Figure 12. Figure 12: NASH improves the performance with other utility functions u. In 12a-12c, we consider regression tasks using the (negated) MSE given by RR model as utility function. In 12d and 12e, (negated) loss is used as utility function. 0 30 60 90 Selected ratio (%) 78 80 82 84 86 Validation acc. Rand Shap Nash (a) WD-LR 0% flip. 0 30 60 90 Selected ratio (%) 60 70 80 Validation acc. Rand Shap Nash (b) WD-LR 10% fli… view at source ↗
Figure 13
Figure 13. Figure 13: Data selection on heterogeneous datasets. Different proportions of train labels are flipped to create heterogeneous datasets. Different ratios of training data are selected and evaluated using validation accuracy. Shapley-based data selection performs better than random when datasets are more heterogeneous. NASH consistently improves over Shapley-based data selection. D.3 Heterogeneous-Quality Datasets Da… view at source ↗
Figure 14
Figure 14. Figure 14: NASH improves the performance over other semivalues. D.5 Ablation Studies D.5.1 CHOICE OF FT IN THE NASH OBJECTIVE In this section, we compare across different choices of aggregating function FT and justify our choice of the exponential form in the main paper. We consider 3 non-linear functions inspired by the learning curves, including the exponential law (Exp), power law (Pow) and logarithmic law (Log) … view at source ↗
Figure 15
Figure 15. Figure 15: Comparison across different choices of FT . The exponential form (in the main paper) consistently gives the best result. 0 30 60 90 Selected ratio (%) 80 82 84 86 Validation acc. Shap Banz LOO Beta(16, 1) Beta(4, 1) Beta(1, 4) Beta(1, 16) (a) WD-LR. 0 30 60 90 Selected ratio (%) 60 70 80 Validation acc. Shap Banz LOO Beta(16, 1) Beta(4, 1) Beta(1, 4) Beta(1, 16) (b) PO-LR. 0 30 60 90 Selected ratio (%) 70… view at source ↗
Figure 16
Figure 16. Figure 16: Comparison across different choices of semivalues. selection ratio is small. However, as the selection ratio gets larger, these values’ performances drop because they fail to consider the long-term interactions among data. On the other hand, semivalues that put larger weights on larger coalitions, such as Beta(1, 16) and Beta(1, 4), do not in contrast result in a better performance when the selection rati… view at source ↗
Figure 17
Figure 17. Figure 17: NASH works well for a wide range of λ’s (i.e., different numbers in the legends). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
read the original abstract

Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that Data Shapley and other semivalues can sometimes perform no better than random for data selection. It identifies 'Shapley-informative' settings where they work well and proposes the NASH framework, which decomposes the target utility function (such as validation accuracy) into simpler Shapley-informative component functions and selects data subsets by optimizing a non-linear aggregation of those components. The authors state that NASH substantially improves the effectiveness of Shapley/semivalue-based selection at minimal extra runtime cost.

Significance. If the empirical gains are robust, this work could meaningfully advance practical data selection in machine learning by providing a targeted way to overcome the known failure modes of raw Shapley values through decomposition and non-linear aggregation. The low additional runtime cost is a practical strength, and the algorithmic construction avoids circularity or parameter-fitting issues. It directly engages with limitations in the semivalue literature and could influence data curation pipelines if the decomposition strategy generalizes.

minor comments (2)
  1. The abstract states performance claims without referencing specific datasets, tasks, or quantitative improvements; adding one sentence on experimental scope would improve readability.
  2. Clarify the precise definition and identification procedure for 'Shapley-informative' component functions in the main text, as this is central to the framework's reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and the recommendation for minor revision. We appreciate the recognition that NASH provides a targeted approach to overcome known limitations of semivalue-based data selection through decomposition and non-linear aggregation, while maintaining low additional runtime cost.

Circularity Check

0 steps flagged

No significant circularity in NASH framework

full rationale

The paper presents NASH as an algorithmic construction: it decomposes a target utility (e.g., validation accuracy) into simpler Shapley-informative component functions and then aggregates them non-linearly to select data. No load-bearing derivation, equation, or prediction is shown that reduces by construction to its own inputs, fitted parameters, or self-citations. The abstract and description frame the contribution as an empirical algorithmic improvement over raw Shapley selection, with no self-definitional steps, uniqueness theorems imported from the authors, or ansatzes smuggled via citation. The approach directly addresses a known practical failure mode without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that suitable Shapley-informative component functions exist for typical validation utilities.

pith-pipeline@v0.9.0 · 5510 in / 1170 out tokens · 16855 ms · 2026-05-13T05:55:57.681093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    The Fifth

    Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Giampiccolo, Danilo , journal=. The Fifth

  2. [2]

    On a modification of

    Bernstein, Sergei , journal=. On a modification of

  3. [3]

    International Game Theory Review , volume=

    A Note on Regular Semivalues , author=. International Game Theory Review , volume=. 2000 , publisher=

  4. [4]

    Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection

    Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection , author=. arXiv preprint arXiv:1102.3975 , year=

  5. [5]

    Journal of Parallel and Distributed Computing , volume=

    Vehicle classification in distributed sensor networks , author=. Journal of Parallel and Distributed Computing , volume=. 2004 , publisher=

  6. [6]

    Mathematics of Operations Research , volume=

    Value theory without efficiency , author=. Mathematics of Operations Research , volume=. 1981 , publisher=

  7. [7]

    Econometrica , volume=

    The bargaining problem , author=. Econometrica , volume=

  8. [8]

    An analysis of approximations for maximizing submodular set functions—

    Nemhauser, George L and Wolsey, Laurence A and Fisher, Marshall L , journal=. An analysis of approximations for maximizing submodular set functions—. 1978 , publisher=

  9. [9]

    SIAM Journal on Optimization , volume=

    Convex approximations of chance constrained programs , author=. SIAM Journal on Optimization , volume=. 2007 , publisher=

  10. [10]

    A remark on

    Robbins, Herbert , journal=. A remark on. 1955 , publisher=

  11. [11]

    Shapley, L. S. , journal=. A Value for

  12. [12]

    Data valuation for medical imaging using

    Tang, Siyi and Ghorbani, Amirata and Yamashita, Rikiya and Rehman, Sameer and Dunnmon, Jared A and Zou, James and Rubin, Daniel L , journal=. Data valuation for medical imaging using. 2021 , publisher=

  13. [13]

    DeRDaVa: Deletion-Robust Data Valuation for Machine Learning , volume=. Proc. AAAI , author=. 2024 , month=. doi:10.1609/aaai.v38i14.29462 , abstractNote=

  14. [14]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    The shape of learning curves: a review , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

  15. [15]

    2015 IEEE symposium series on computational intelligence , pages=

    Calibrating probability with undersampling for unbalanced classification , author=. 2015 IEEE symposium series on computational intelligence , pages=. 2015 , organization=

  16. [16]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

  17. [17]

    Proceedings of the third international workshop on paraphrasing (IWP2005) , year=

    Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the third international workshop on paraphrasing (IWP2005) , year=

  18. [18]

    2019 , organization=

    Ghorbani, Amirata and Zou, James , booktitle=. 2019 , organization=

  19. [19]

    Most influential subset selection: Challenges, promises, and beyond , author=. Proc. NeurIPS , volume=

  20. [20]

    Towards efficient data valuation based on the

    Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Hynes, Nick and G. Towards efficient data valuation based on the. Proc. AISTATS , pages=. 2019 , organization=

  21. [21]

    Proceedings of the VLDB Endowment , volume=

    Efficient task-specific data valuation for nearest neighbor algorithms , author=. Proceedings of the VLDB Endowment , volume=. 2019 , publisher=

  22. [22]

    Jiang, Kevin and Liang, Weixin and Zou, James Y and Kwon, Yongchan , booktitle=

  23. [23]

    Understanding black-box predictions via influence functions , author=. Proc. ICML , pages=. 2017 , organization=

  24. [24]

    2022 , organization=

    Kwon, Yongchan and Zou, James , booktitle=. 2022 , organization=

  25. [25]

    Faster approximation of probabilistic and distributional values via least squares , author=. Proc. ICLR , year=

  26. [26]

    One Sample Fits All: Approximating All Probabilistic Values Simultaneously and Efficiently , author=. Proc. NeurIPS , year=

  27. [27]

    Coresets for data-efficient training of machine learning models , author=. Proc. ICML , pages=. 2020 , organization=

  28. [28]

    Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , author=. Proc. ACL , pages=

  29. [29]

    Estimating Training Data Influence by Tracing Gradient Descent , author=. Proc. NeurIPS , volume=

  30. [30]

    COLT , pages=

    A generalized representer theorem , author=. COLT , pages=. 2001 , organization=

  31. [31]

    Ingredients

    Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges , author=. Proc. IJCAI , pages=

  32. [32]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  33. [33]

    2019 18th European Control Conference (ECC) , pages=

    Performance guarantees for greedy maximization of non-submodular controllability metrics , author=. 2019 18th European Control Conference (ECC) , pages=. 2019 , organization=

  34. [34]

    Subset Selection in Machine Learning: From Theory to Applications , author =. Proc. ICML 2021 Workshop on SubSetML , year =

  35. [35]

    Sundararajan, Mukund and Dhamdhere, Kedar and Agarwal, Ashish , booktitle=. The. 2020 , organization=

  36. [36]

    2023 , organization=

    Wang, Jiachen T and Jia, Ruoxi , booktitle=. 2023 , organization=

  37. [37]

    Rethinking

    Wang, Jiachen T and Yang, Tianji and Zou, James and Kwon, Yongchan and Jia, Ruoxi , booktitle=. Rethinking. 2024 , organization=

  38. [38]

    Helpful or Harmful Data? Fine-tuning-free

    Wang, Jingtan and Lin, Xiaoqiang and Qiao, Rui and Foo, Chuan-Sheng and Low, Bryan Kian Hsiang , booktitle=. Helpful or Harmful Data? Fine-tuning-free. 2024 , organization=

  39. [39]

    Wang, Jiachen T and Mittal, Prateek and Song, Dawn and Jia, Ruoxi , booktitle=

  40. [40]

    Submodularity in data subset selection and active learning , author=. Proc. ICML , pages=. 2015 , organization=

  41. [41]

    Representer point selection for explaining deep neural networks , author=. Proc. NeurIPS , volume=

  42. [42]

    2024 , eprint=

    A Survey on Data Selection for Language Models , author=. 2024 , eprint=

  43. [43]

    2025 , eprint=

    Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making , author=. 2025 , eprint=

  44. [44]

    2026 , eprint=

    Provably Adaptive Linear Approximation for the Shapley Value and Beyond , author=. 2026 , eprint=

  45. [45]

    Wang and Yuqing Zhu and Yu-Xiang Wang and Ruoxi Jia and Prateek Mittal , year=

    Jiachen T. Wang and Yuqing Zhu and Yu-Xiang Wang and Ruoxi Jia and Prateek Mittal , year=. Threshold. 2308.15709 , archivePrefix=

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller ...

  47. [47]

    2024 , booktitle=

    Advancing Data Selection for Foundation Models: From Heuristics to Principled Methods , author=. 2024 , booktitle=

  48. [48]

    Percolation , pages=

    What is percolation? , author=. Percolation , pages=. 2012 , publisher=

  49. [49]

    Vanschoren, Joaquin. Wind

  50. [50]

    Vanschoren, Joaquin. Pol

  51. [51]

    Vanschoren, Joaquin. CPU

  52. [52]

    2dplanes

    Vanschoren, Joaquin. 2dplanes

  53. [53]

    APSF ailure

    Gijsbers, Pieter. APSF ailure

  54. [54]

    C alifornia Housing

    Gazioglu, Mine. C alifornia Housing

  55. [55]

    Physiochemical Protein

    Fischer, Sebastian. Physiochemical Protein

  56. [56]

    Auction Verification

    Fischer, Sebastian. Auction Verification