Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values
Pith reviewed 2026-05-09 20:20 UTC · model grok-4.3
The pith
K-Shapley value uniquely satisfies four axioms and supports an algorithm with O(T^{3/4}) fairness regret in budgeted combinatorial bandits under full-bandit feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We extend the Shapley value to the K-Shapley value, which captures the marginal contribution of an agent restricted to a set of size at most K. We show that K-Shapley value is a unique solution concept that satisfies Symmetry, Linearity, Null player, and efficiency properties. We next propose K-SVFair-FBF, a fairness-aware bandit algorithm that adaptively estimates K-Shapley value with unknown valuation function. Unlike standard bandit literature on full bandit feedback, K-SVFair-FBF not only learns the valuation function under full feedback setting but also mitigates the noise arising from Monte Carlo approximations. Theoretically, we prove that K-SVFair-FBF achieves O(T^{3/4}) regret bound
What carries the argument
K-Shapley value, the average marginal contribution of an arm computed only over coalitions of size at most K, which serves as the merit measure for fair arm selection.
If this is right
- K-Shapley value is the unique concept satisfying symmetry, linearity, null-player, and efficiency axioms.
- The algorithm simultaneously learns the valuation function and keeps Monte Carlo noise from breaking fairness.
- Fairness regret is bounded by O(T^{3/4}), which is sublinear.
- The method is demonstrated on federated learning and social influence maximization tasks.
Where Pith is reading between the lines
- The same axiomatic construction could be used to define fairness in other online allocation problems that lack per-arm observations.
- Noise control techniques developed here may stabilize other cooperative-game allocations when only aggregate rewards are available.
- Because the setting uses stricter full-bandit feedback than semi-bandit models, the approach may extend to environments with even less information.
Load-bearing premise
The unknown valuation function can be adaptively estimated from full-bandit observations while the Monte Carlo noise in K-Shapley approximation can be controlled without destroying the fairness guarantee.
What would settle it
A controlled simulation with known valuations where the algorithm's fairness regret grows linearly in T or where the computed K-Shapley values produce selections that violate one of the four axioms.
Figures
read the original abstract
We propose a new framework for meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). Unlike semi-bandit feedback, the contribution of individual arms is not received in full-bandit feedback, making the setting significantly more challenging. To compute arm contributions in BCMAB-FBF, we first extend the Shapley value, a classical solution concept from cooperative game theory, to the $K$-Shapley value, which captures the marginal contribution of an agent restricted to a set of size at most $K$. We show that $K$-Shapley value is a unique solution concept that satisfies Symmetry, Linearity, Null player, and efficiency properties. We next propose K-SVFair-FBF, a fairness-aware bandit algorithm that adaptively estimates $K$-Shapley value with unknown valuation function. Unlike standard bandit literature on full bandit feedback, K-SVFair-FBF not only learns the valuation function under full feedback setting but also mitigates the noise arising from Monte Carlo approximations. Theoretically, we prove that K-SVFair-FBF achieves $O(T^{3/4})$ regret bound on fairness regret. Through experiments on federated learning and social influence maximization datasets, we demonstrate that our approach achieves fairness and performs more effectively than existing baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the K-Shapley value as an extension of the Shapley value to capture marginal contributions restricted to coalitions of size at most K. It proves uniqueness of this value under the axioms of Symmetry, Linearity, Null player, and Efficiency. The authors then propose the K-SVFair-FBF algorithm for budgeted combinatorial multi-armed bandits under full-bandit feedback, which adaptively estimates the unknown valuation function and the K-Shapley values while mitigating Monte Carlo approximation noise. They claim that this yields an O(T^{3/4}) bound on fairness regret and demonstrate the approach on federated learning and social influence maximization tasks.
Significance. If the regret analysis rigorously controls the Monte Carlo estimation error under full-bandit feedback, the result would provide a novel fairness mechanism for combinatorial bandits with limited observability. The axiomatic characterization of K-Shapley and the explicit handling of approximation noise in the algorithm are potentially useful contributions to meritocratic fairness in sequential decision-making.
major comments (3)
- [Theoretical analysis / regret theorem] Theoretical analysis section (around the regret theorem): The O(T^{3/4}) fairness regret bound is stated for K-SVFair-FBF, but the manuscript does not provide an explicit derivation or lemma bounding the additional error term induced by finite-sample Monte Carlo estimation of the K-Shapley values. Under full-bandit feedback only the scalar reward of the chosen subset is observed, so each round supplies limited information about individual marginal contributions; without a concrete concentration inequality or sample-complexity argument showing that the MC variance is absorbed into the T^{3/4} rate (e.g., when the number of samples grows only polylogarithmically), the claimed guarantee does not follow.
- [K-Shapley definition and uniqueness] Definition and uniqueness of K-Shapley (early theoretical section): The uniqueness claim under the four listed axioms is asserted, yet no proof sketch, inductive argument, or reference to the standard Shapley uniqueness proof adapted to the size-K restriction is supplied. The manuscript should either include the key steps or clarify the precise assumptions on the valuation function v (e.g., whether it is monotone, submodular, or merely any set function) that are required for the axiomatic characterization to hold.
- [K-SVFair-FBF algorithm] Algorithm description and estimation procedure: The adaptive estimation of the unknown valuation function from full-bandit observations is described at a high level, but the manuscript does not specify how the Monte Carlo samples for marginal contributions are drawn or how the resulting estimator is plugged into the fairness constraint without violating the O(T^{3/4}) bound. A concrete description of the sampling distribution and the bias-variance trade-off is needed.
minor comments (2)
- [Experiments] Experiments section: Quantitative results, confidence intervals, and explicit baseline comparisons (including standard combinatorial bandit algorithms and fairness-agnostic variants) are missing; only qualitative statements are provided.
- [Notation throughout] Notation: The distinction between the true K-Shapley value and its Monte Carlo estimate is not consistently denoted throughout the regret analysis, making it difficult to track error propagation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the current manuscript would benefit from expanded details on the regret analysis, uniqueness proof, and algorithm. We will revise accordingly and respond point by point below.
read point-by-point responses
-
Referee: [Theoretical analysis / regret theorem] Theoretical analysis section (around the regret theorem): The O(T^{3/4}) fairness regret bound is stated for K-SVFair-FBF, but the manuscript does not provide an explicit derivation or lemma bounding the additional error term induced by finite-sample Monte Carlo estimation of the K-Shapley values. Under full-bandit feedback only the scalar reward of the chosen subset is observed, so each round supplies limited information about individual marginal contributions; without a concrete concentration inequality or sample-complexity argument showing that the MC variance is absorbed into the T^{3/4} rate (e.g., when the number of samples grows only polylogarithmically), the claimed guarantee does not follow.
Authors: We acknowledge that an explicit lemma bounding the Monte Carlo error is missing from the current version. In the revision we will add Lemma 5 (or similar) that applies Hoeffding's inequality to the MC estimator for each K-Shapley value. With O(log T) samples per estimation phase and reuse of past observations, the per-round additive error is O(T^{-1/4}), which integrates into the existing O(T^{3/4}) fairness-regret analysis without altering the rate. This accounts for the scalar observation per round under full-bandit feedback. revision: yes
-
Referee: [K-Shapley definition and uniqueness] Definition and uniqueness of K-Shapley (early theoretical section): The uniqueness claim under the four listed axioms is asserted, yet no proof sketch, inductive argument, or reference to the standard Shapley uniqueness proof adapted to the size-K restriction is supplied. The manuscript should either include the key steps or clarify the precise assumptions on the valuation function v (e.g., whether it is monotone, submodular, or merely any set function) that are required for the axiomatic characterization to hold.
Authors: We will insert a concise proof sketch (new Theorem 2 or appendix) that adapts the classical Shapley uniqueness argument to coalitions of size at most K. The argument proceeds by showing that any value satisfying the four axioms must coincide with the K-Shapley formula, using linearity to decompose v and symmetry/null-player to fix the weights. The characterization holds for arbitrary set functions v; no monotonicity or submodularity is required. We will state this explicitly. revision: yes
-
Referee: [K-SVFair-FBF algorithm] Algorithm description and estimation procedure: The adaptive estimation of the unknown valuation function from full-bandit observations is described at a high level, but the manuscript does not specify how the Monte Carlo samples for marginal contributions are drawn or how the resulting estimator is plugged into the fairness constraint without violating the O(T^{3/4}) bound. A concrete description of the sampling distribution and the bias-variance trade-off is needed.
Authors: We will expand the algorithm section (and pseudocode) to specify that, at each round t, we draw M = O(log T) random subsets S of size at most K by including each arm independently with probability 1/2, then estimate marginal contributions from the single observed full-bandit reward of the played action. The resulting estimator is inserted into the fairness constraint via a projected update whose step-size is tuned to keep the bias-variance contribution inside the O(T^{3/4}) envelope. A short bias-variance lemma will be added to justify the choice of M. revision: yes
Circularity Check
No significant circularity: K-Shapley defined axiomatically; regret bound is independent analysis
full rationale
The derivation begins with an explicit extension of the classical Shapley value to K-Shapley via marginal contributions over coalitions of size at most K, followed by a direct proof that this satisfies the four standard axioms and is unique under them. This is a self-contained axiomatic characterization, not a self-definition or renaming. The K-SVFair-FBF algorithm is then constructed to estimate the unknown valuation function from full-bandit observations while bounding Monte Carlo approximation error; the O(T^{3/4}) fairness-regret guarantee is presented as a new concentration analysis that absorbs the approximation noise rather than reducing to a fitted parameter or prior self-citation. No load-bearing self-citations, ansatzes smuggled via citation, or fitted-input-called-prediction patterns appear in the central claims. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption K-Shapley value satisfies Symmetry, Linearity, Null player, and efficiency
invented entities (1)
-
K-Shapley value
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[Albizuriet al., 2022 ] M Josune Albizuri, Satoshi Masuya, and Jos´e M Zarzuelo. Characterization of a value for games under restricted cooperation.Annals of Operations Re- search, 318(2):773–785,
work page 2022
-
[2]
[Amiet al., 2025 ] Dan Ben Ami, Kobi Cohen, and Qing Zhao. Client selection for generalization in accelerated federated learning: A multi-armed bandit approach.IEEE Access, 13:33697–33713,
work page 2025
-
[3]
[Anet al., 2019 ] Qingxian An, Yao Wen, Tao Ding, and Yongli Li. Resource sharing and payoff allocation in a three-stage system: Integrating network dea with the shap- ley value method.Omega, 85:16–25,
work page 2019
-
[4]
Regret in online combinatorial optimization.Mathematics of Operations Research, 39(1):31–45,
[Audibertet al., 2014 ] Jean-Yves Audibert, S ´ebastien Bubeck, and G´abor Lugosi. Regret in online combinatorial optimization.Mathematics of Operations Research, 39(1):31–45,
work page 2014
-
[5]
Combinatorial multi-armed bandit: General framework and applications
[Chenet al., 2013 ] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. InInternational conference on machine learn- ing, pages 151–159. PMLR,
work page 2013
-
[6]
Merit-based fair combinatorial semi-bandit with unrestricted feedback delays
[Chenet al., 2024 ] Ziqun Chen, Kechao Cai, Zhuoyue Chen, Jinbei Zhang, and John Lui. Merit-based fair combinatorial semi-bandit with unrestricted feedback delays. InECAI 2024, pages 2058–2065. IOS Press,
work page 2024
-
[7]
[Derks and Peters, 1993] Jean Derks and Hans Peters. A shapley value for games with restricted coalitions.Interna- tional Journal of Game Theory, 21(4):351–360,
work page 1993
-
[8]
[Fouratiet al., 2023 ] Fares Fourati, Vaneet Aggarwal, Christopher Quinn, and Mohamed-Slim Alouini. Ran- domized greedy learning for non-monotone stochastic submodular maximization under full-bandit feedback. In International Conference on Artificial Intelligence and Statistics, pages 7455–7471. PMLR,
work page 2023
-
[9]
[Fouratiet al., 2024 ] Fares Fourati, Christopher John Quinn, Mohamed-Slim Alouini, and Vaneet Aggarwal. Combina- torial stochastic-greedy bandit.Proceedings of the AAAI Conference on Artificial Intelligence, 38(11):12052–12060, Mar
work page 2024
-
[10]
[Fryeet al., 2020 ] Christopher Frye, Colin Rowat, and Ilya Feige. Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability.Advances in neural information processing systems, 33:1229–1239,
work page 2020
-
[11]
Dependent rounding and its applications to approximation algorithms
[Gandhiet al., 2006 ] Rajiv Gandhi, Samir Khuller, Srini- vasan Parthasarathy, and Aravind Srinivasan. Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324–360,
work page 2006
-
[12]
Data shapley: Equitable valuation of data for machine learn- ing
[Ghorbani and Zou, 2019] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learn- ing. InInternational conference on machine learning, pages 2242–2251. PMLR,
work page 2019
-
[13]
Improving fairness in adaptive social exergames via shapley bandits
[Grayet al., 2023 ] Robert C Gray, Jennifer Villareale, Thomas Boyd Fox, Diane H Dallal, Santiago Onta ˜n´on, Danielle Arigo, Shahin Jabbari, and Jichen Zhu. Improving fairness in adaptive social exergames via shapley bandits. InProceedings of the 28th International Conference on Intelligent User Interfaces, pages 322–336,
work page 2023
-
[14]
[Guillaume, 2008] L Guillaume. Fast unfolding of commu- nities in large networks.Journal Statistical Mechanics: Theory and Experiment, 10:P1008,
work page 2008
-
[15]
[Huanget al., 2020 ] Zhiming Huang, Yifan Xu, Bingshan Hu, Qipeng Wang, and Jianping Pan. Thompson sampling for combinatorial semi-bandits with sleeping arms and long- term fairness constraints.arXiv preprint arXiv:2005.06725,
-
[16]
Resource allocation using shapley value in lte networks
[Iturraldeet al., 2011 ] Mauricio Iturralde, Tara Ali Yahiya, Anne Wei, and Andr ´e-Luc Beylot. Resource allocation using shapley value in lte networks. In2011 IEEE 22nd International Symposium on Personal, Indoor and Mobile Radio Communications, pages 31–35. IEEE,
work page 2011
-
[17]
Identifying top-k players in cooperative games via shapley bandits
[Kolpaczkiet al., 2021 ] Patrick Kolpaczki, Viktor Bengs, and Eyke H¨ullermeier. Identifying top-k players in cooperative games via shapley bandits. InLWDA, pages 133–144,
work page 2021
-
[18]
[Leskovec and Mcauley, 2012] Jure Leskovec and Julian Mcauley. Learning to discover social circles in ego net- works.Advances in neural information processing systems, 25,
work page 2012
-
[19]
[Liet al., 2019 ] Fengjiao Li, Jia Liu, and Bo Ji. Combinato- rial sleeping bandits with fairness constraints.IEEE Trans- actions on Network Science and Engineering, 7(3):1799– 1813,
work page 2019
-
[20]
[Maschleret al., 2020 ] Michael Maschler, Shmuel Zamir, and Eilon Solan.Game theory. Cambridge University Press,
work page 2020
-
[21]
[Militanoet al., 2016 ] Leonardo Militano, Michele Nitti, Luigi Atzori, and Antonio Iera. Enhancing the naviga- bility in a social network of smart objects: A shapley-value based approach.Computer Networks, 103:1–14,
work page 2016
-
[22]
Game of gradients: Mitigating ir- relevant clients in federated learning
[Nagalapatti and Narayanam, 2021] Lokesh Nagalapatti and Ramasuri Narayanam. Game of gradients: Mitigating ir- relevant clients in federated learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9046–9054,
work page 2021
-
[23]
[Narahari, 2014] Yadati Narahari.Game theory and mecha- nism design, volume
work page 2014
-
[24]
[Narayanam and Narahari, 2010] Ramasuri Narayanam and Yadati Narahari. A shapley value-based approach to dis- cover influential nodes in social networks.IEEE trans- actions on automation science and engineering, 8(1):130– 147,
work page 2010
-
[25]
[Nguyenet al., 2022 ] Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Trade-off between payoff and model rewards in shapley-fair collabo- rative machine learning.Advances in Neural Information Processing Systems, 35:30542–30553,
work page 2022
-
[26]
Online learning via offline greedy algo- rithms: Applications in market design and optimization
[Niazadehet al., 2021 ] Rad Niazadeh, Negin Golrezaei, Joshua R Wang, Fransisca Susan, and Ashwinkumar Badanidiyuru. Online learning via offline greedy algo- rithms: Applications in market design and optimization. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 737–738,
work page 2021
-
[27]
An explore-then-commit algorithm for submodular maximization under full-bandit feedback
[Nieet al., 2022 ] Guanyu Nie, Mridul Agarwal, Ab- hishek Kumar Umrawal, Vaneet Aggarwal, and Christo- pher John Quinn. An explore-then-commit algorithm for submodular maximization under full-bandit feedback. In James Cussens and Kun Zhang, editors,Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, volume 180 ofProceedi...
work page 2022
-
[28]
Anytime fairness guarantees in stochastic combinatorial mabs: A novel learn- ing framework
[Pokhriyalet al., 2025 ] Subham Pokhriyal, Shweta Jain, Ganesh Ghalme, and Vaneet Aggarwal. Anytime fairness guarantees in stochastic combinatorial mabs: A novel learn- ing framework. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 1660–1669,
work page 2025
-
[29]
[Reinhardt and Dada, 2005] Gilles Reinhardt and Maqbool Dada. Allocating the gains from resource pooling with the shapley value.Journal of the Operational Research Society, 56(8):997–1000,
work page 2005
-
[30]
Top-k combinatorial bandits with full-bandit feedback
[Rejwan and Mansour, 2020] Idan Rejwan and Yishay Man- sour. Top-k combinatorial bandits with full-bandit feedback. InAlgorithmic Learning Theory, pages 752–776. PMLR,
work page 2020
-
[31]
Some aspects of the se- quential design of experiments
[Robbins, 1952] Herbert Robbins. Some aspects of the se- quential design of experiments
work page 1952
-
[32]
[Shapley and others, 1953] Lloyd S Shapley et al. A value for n-person games
work page 1953
-
[33]
[Shapley, 1971] Lloyd S Shapley. Cores of convex games. International journal of game theory, 1:11–26,
work page 1971
-
[34]
Shapleyfl: Robust federated learning based on shapley value
[Sunet al., 2023 ] Qiheng Sun, Xiang Li, Jiayao Zhang, Li Xiong, Weiran Liu, Jinfei Liu, Zhan Qin, and Kui Ren. Shapleyfl: Robust federated learning based on shapley value. InProceedings of the 29th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining, pages 2096–2108,
work page 2023
-
[35]
The many shapley values for model expla- nation
[Sundararajan and Najmi, 2020] Mukund Sundararajan and Amir Najmi. The many shapley values for model expla- nation. InInternational conference on machine learning, pages 9269–9278. PMLR,
work page 2020
-
[36]
[Tastanet al., 2024 ] Nurbek Tastan, Samar Fares, Toluwani Aremu, Samuel Horvath, and Karthik Nandakumar. Re- defining contributions: Shapley-driven federated learning. arXiv preprint arXiv:2406.00569,
-
[37]
Measure contribution of participants in feder- ated learning
[Wanget al., 2019 ] Guan Wang, Charlie Xiaoqian Dang, and Ziye Zhou. Measure contribution of participants in feder- ated learning. In2019 IEEE international conference on big data (Big Data), pages 2597–2604. IEEE,
work page 2019
-
[38]
Shapley q-value: A local reward approach to solve global reward games
[Wanget al., 2020 ] Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7285–7292,
work page 2020
-
[39]
Fairness of exposure in stochastic ban- dits
[Wanget al., 2021 ] Lequn Wang, Yiwei Bai, Wen Sun, and Thorsten Joachims. Fairness of exposure in stochastic ban- dits. InInternational Conference on Machine Learning, pages 10686–10696. PMLR,
work page 2021
-
[40]
[Willson, 1993] Stephen J Willson. A value for partially de- fined cooperative games.International Journal of Game Theory, 21(4):371–384,
work page 1993
-
[41]
[Xiaet al., 2020 ] Wenchao Xia, Tony QS Quek, Kun Guo, Wanli Wen, Howard H Yang, and Hongbo Zhu. Multi- armed bandit-based client scheduling for federated learn- ing.IEEE Transactions on Wireless Communications, 19(11):7108–7123,
work page 2020
-
[42]
Combinatorial multi-armed bandits with con- cave rewards and fairness constraints
[Xuet al., 2020 ] Huanle Xu, Yang Liu, Wing Cheong Lau, and Rui Li. Combinatorial multi-armed bandits with con- cave rewards and fairness constraints. InIJCAI, pages 2554–2560,
work page 2020
-
[43]
[Yanget al., 2022 ] Xun Yang, Weijie Tan, Changgen Peng, Shuwen Xiang, and Kun Niu. Federated learning in- centive mechanism design via enhanced shapley value method.Wireless Communications and Mobile Computing, 2022(1):9690657,
work page 2022
-
[44]
Thus, with probability at least1−(δ 1 +δ 2), the true Shapley value of arma j lies within the following confidence interval ϕaj ∈ " ˆϕt,aj − r 1 2Nt,aj R ln 2M δ1 + 2 r 1 2L ln 4Nt,aj RM δ2 , ˆϕt,aj + r 1 2Nt,aj R ln 2M δ1 + 2 r 1 2L ln 4Nt,aj RM δ2 # Lemma 2.For anyδ∈(0,1), with probability at least1−δ/2, it holds that TX t=1 Ea∼πt q 1 Nt,a − TX t=1 X a∈...
work page 2024
-
[45]
The fairness regret can be upper bounded as follows: FRT = ⌈M/K⌉X t=1 MX a=1 πt(a)−π ∗(a) + TX t=⌈M/K⌉+1 MX a=1 πt(a)−π ∗(a) FRT ≤ M K + 1)K ! + 2KP a ϕa TX t=⌈M/K⌉+1 Eat∼πt "s 1 2Nt,aR ln 2M δ1 + 2 s 1 2L ln 4Nt,aRM δ2 # In our setting, the algorithm proceeds in discrete rounds, and within each round, it executes multiple timesteps corresponding to Monte...
work page 2020
-
[46]
Real-World Dataset for Social Influence Maximization: For the influence maximization experiments, we used a subset of the Facebook network introduced by [Leskovec and Mcauley, 2012]. The same dataset has also been used by [Nieet al., 2022; Fouratiet al., 2024 ] in their work, which further demonstrates its suitability for studying influence spread in soci...
work page 2012
-
[47]
The results in Figures 3, 4, and 5 show that increasing R and L improves the stability of the estimated K-Shapley values, and hence meritocratic fairness, but at the cost of higher computational overhead. The configurationR= 500 , L= 200 provides the best trade-off between performance and efficiency, yielding stable convergence and fairness consistency ac...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.