Data Shapley: Equitable Valuation of Data for Machine Learning

Amirata Ghorbani; James Zou

arxiv: 1904.02868 · v2 · pith:C3KCFWXInew · submitted 2019-04-05 · 📊 stat.ML · cs.AI· cs.LG

Data Shapley: Equitable Valuation of Data for Machine Learning

Amirata Ghorbani , James Zou This is my paper

classification 📊 stat.ML cs.AIcs.LG

keywords datashapleylearningvalueequitablevaluationpredictorwhat

0 comments

read the original abstract

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Residual Feature Integration is Sufficient to Prevent Negative Transfer
cs.LG 2025-05 unverdicted novelty 7.0

Residual feature integration with a trainable target-side encoder provably prevents negative transfer, achieving convergence rates no worse than training from scratch under informative target distributions.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces
cs.DB 2026-05 unverdicted novelty 5.0

CHRONOS is a three-layer system for evolving data marketplaces that applies neural-ODE temporal decay, changepoint-aware Shapley valuation, and EXP3-IX private coordination to achieve 0.937 recall, 2.74 qps, 161 ms la...