Near-Optimal Privacy-Preserving Learning for Max-Min Fair Multi-Agent Bandits

Amir Leshem

arxiv: 2306.04498 · v3 · submitted 2023-06-07 · 💻 cs.LG · cs.CY· cs.DC

Near-Optimal Privacy-Preserving Learning for Max-Min Fair Multi-Agent Bandits

Amir Leshem This is my paper

Pith reviewed 2026-05-24 08:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CYcs.DC

keywords multi-agent banditsmax-min fairnessprivacy-preserving learningdistributed algorithmscollision modelregret boundsbandit exploration

0 comments

The pith

A fully distributed algorithm learns max-min fair allocations in multi-agent bandits with O(N^3 f(log T) log T) regret while keeping all reward data private to each agent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a protocol for multiple agents to coordinate on a max-min fair arm allocation in a bandit setting where they can only communicate through collisions and must not share their reward observations. This is achieved by combining local exploration with distributed mechanisms for ordering agents and testing fairness thresholds via auctions. A sympathetic reader would care because it reduces the dependence on the number of agents from exponential to cubic while maintaining privacy, making large-scale fair learning feasible without a central collector of data. The regret bound shows near-logarithmic scaling with time horizon under mild conditions on the function f.

Core claim

We propose a fully distributed algorithm for bounded rewards with unknown support, achieving regret O(N^3 f(log T)log T), where f is any nondecreasing diverging function satisfying f(k-1)/f(k)→1. The algorithm combines distributed agent ordering, cumulative round-robin exploration, endpoint-revalidated warm-started bisection, and a collision-based distributed auction for threshold-feasibility tests. No agent collects the reward observations, empirical estimates, or preferences of the others, thus preserving reward privacy while coordinating only through collision outcomes.

What carries the argument

Endpoint-revalidated warm-started bisection combined with collision-based distributed auction for threshold-feasibility tests.

If this is right

The regret scales polynomially with the number of agents N rather than exponentially as in prior privacy-preserving methods.
All reward samples and empirical estimates remain local to each agent with no sharing.
The protocol applies to bounded rewards with unknown support.
Coordination occurs solely through collision outcomes without explicit communication or a leader.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This coordination style might extend to other distributed decision problems that rely on limited feedback like collisions.
One could check whether alternative ordering rules reduce the cubic dependence on N.
Real deployments would need to verify if collision detection noise affects the stability of the bisection step.

Load-bearing premise

The concentration of cumulative empirical estimates and the stability of endpoint-revalidated bisection must hold under the collision-only model with bounded but unknown-support rewards.

What would settle it

A simulation or analysis showing that the bisection search fails to identify the correct max-min threshold for some bounded reward distributions, or that the observed regret grows faster than N^3 log T with increasing number of agents.

Figures

Figures reproduced from arXiv: 2306.04498 by Amir Leshem.

**Figure 2.** Figure 2: Evolution of the regret vs. time. N, as well as the polynomial dependence on the number of agents [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the regret vs. number of agents. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

We study fair multi-agent multi-armed bandit learning under collision-only coordination. Agents cannot communicate explicitly during learning and observe only their own rewards and whether collisions occur when several agents access the same arm. The goal is to learn a max-min fair allocation while keeping each agent's reward samples and empirical reward estimates local. We propose a fully distributed algorithm for bounded rewards with unknown support, achieving regret $O\!\left(N^3 f(\log T)\log T\right)$, where $f$ is any nondecreasing diverging function satisfying $f(k-1)/f(k)\to 1$. The algorithm combines distributed agent ordering, cumulative round-robin exploration, endpoint-revalidated warm-started bisection, and a collision-based distributed auction for threshold-feasibility tests. Unlike leader-based optimal algorithms, no agent collects the reward observations, empirical estimates, or preferences of the others. Thus, the protocol preserves reward privacy in the operational sense of avoiding reward sharing, while coordinating only through collision outcomes. Compared with previous privacy-preserving algorithms for max--min fair bandits, which have exponential dependence on the number of agents, our method achieves polynomial $N^3$ dependence while retaining near-logarithmic dependence on $T$. The analysis uses concentration of cumulative empirical estimates and stability of endpoint-revalidated bisection. Simulations confirm the predicted scaling with horizon, number of agents, and max--min gap across representative numerical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New distributed algorithm improves N-scaling for privacy-preserving max-min fair bandits but the key concentration and bisection claims rest on unshown steps under collisions and unknown support.

read the letter

The paper's main contribution is a fully distributed algorithm that achieves max-min fair allocation in multi-agent bandits while keeping all reward samples and estimates local to each agent. It reaches O(N^3 f(log T) log T) regret using distributed ordering, cumulative round-robin exploration, endpoint-revalidated bisection, and a collision-based auction. This is a clear step up from earlier privacy-preserving methods that scaled exponentially in N, and the operational privacy guarantee (no sharing of observations or preferences) is stated cleanly. The combination of primitives is new enough to be worth noting if the analysis holds. Simulations are mentioned as confirming the scaling, which is better than nothing. The soft spot is the analysis. The bound depends on concentration of cumulative estimates despite collisions (which create dependence and missing samples) and on stability of the warm-started bisection when the reward support is unknown. The abstract names these ingredients but gives no derivation or error-bar details, so it is not possible to check whether the deviation probabilities survive the collision model at the claimed rate. If those two properties do not hold as assumed, the polynomial-N claim does not follow. This work is aimed at researchers in decentralized bandits and privacy-constrained multi-agent learning. A reader who already works on collision models or fair allocation might extract the algorithmic ideas even if the proofs need tightening. It deserves a serious referee because the scaling improvement is concrete and the privacy model is operational rather than information-theoretic, but the review should focus on the concentration argument and the handling of unknown bounds.

Referee Report

2 major / 2 minor

Summary. The paper proposes a fully distributed algorithm for max-min fair multi-agent multi-armed bandit learning under collision-only coordination (no explicit communication, agents observe only own rewards and collisions). For bounded rewards with unknown support, it achieves regret O(N³ f(log T) log T) where f is any nondecreasing diverging function with f(k-1)/f(k)→1. The algorithm uses distributed agent ordering, cumulative round-robin exploration, endpoint-revalidated warm-started bisection, and a collision-based distributed auction; no agent shares reward observations, empirical estimates, or preferences. The analysis relies on concentration of cumulative empirical estimates and stability of endpoint-revalidated bisection. Simulations are reported to confirm scaling with T, N, and max-min gap.

Significance. If the stated regret bound holds with the claimed polynomial-N and near-log-T dependence while preserving operational privacy (no reward sharing), the result would improve substantially on prior privacy-preserving max-min fair bandit algorithms that exhibit exponential N dependence. The combination of fully distributed coordination via collisions only and the specific technical components (cumulative exploration + warm-started bisection + auction) would constitute a meaningful technical contribution to scalable fair multi-agent learning.

major comments (2)

[Abstract] Abstract: The central O(N³ f(log T) log T) regret claim rests on concentration of cumulative empirical estimates holding under the collision-only model (which induces dependence and missing observations) with unknown reward support. The abstract states that the analysis uses this concentration but provides no indication of how the unknown bounds enter the deviation probabilities or how collision-induced sample loss is bounded in the regret decomposition; this is load-bearing for both the polynomial-N and near-log-T terms.
[Abstract] Abstract: The stability of endpoint-revalidated warm-started bisection for the distributed auction threshold-feasibility tests is invoked to support the regret bound, yet the abstract does not detail how stability is maintained when the reward support is unknown (i.e., no a-priori bounds for revalidation). If this stability fails to hold at the required rate, the auction component cannot guarantee the claimed coordination without reward sharing.

minor comments (2)

The definition of f (nondecreasing, diverging, f(k-1)/f(k)→1) is given but no concrete examples (e.g., iterated logarithms) are supplied to illustrate the class of functions for which the bound applies.
The abstract mentions simulations confirming scaling but does not specify the number of runs, error-bar construction, or data-exclusion rules used to generate the reported plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying points where the abstract could better signal the technical handling of key components. We address each major comment below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central O(N³ f(log T) log T) regret claim rests on concentration of cumulative empirical estimates holding under the collision-only model (which induces dependence and missing observations) with unknown reward support. The abstract states that the analysis uses this concentration but provides no indication of how the unknown bounds enter the deviation probabilities or how collision-induced sample loss is bounded in the regret decomposition; this is load-bearing for both the polynomial-N and near-log-T terms.

Authors: We agree that the abstract would benefit from a concise indication of these elements. In the revised manuscript we will add a short clause noting that the concentration analysis bounds the effective number of collision-free samples per agent-arm pair (via a union bound over collision patterns) and handles unknown support through an initial range-estimation phase whose cost is absorbed into the f(log T) factor. The full decomposition appears in the regret analysis sections; the abstract revision will make this explicit without altering the claimed bound. revision: yes
Referee: [Abstract] Abstract: The stability of endpoint-revalidated warm-started bisection for the distributed auction threshold-feasibility tests is invoked to support the regret bound, yet the abstract does not detail how stability is maintained when the reward support is unknown (i.e., no a-priori bounds for revalidation). If this stability fails to hold at the required rate, the auction component cannot guarantee the claimed coordination without reward sharing.

Authors: We acknowledge the abstract does not currently address stability under unknown support. We will revise the abstract to state that endpoint revalidation employs adaptively estimated bounds derived from observed samples, with stability shown via a high-probability argument that the revalidation error remains controlled by the same concentration tools used elsewhere. The detailed stability proof is contained in the auction analysis; the abstract change will reference this mechanism at a high level. revision: yes

Circularity Check

0 steps flagged

No circularity: regret bound derived from algorithm components and stated concentration properties

full rationale

The provided abstract and description present the O(N^3 f(log T) log T) regret as following from the proposed distributed algorithm (agent ordering, round-robin exploration, endpoint-revalidated bisection, collision-based auction) together with two technical properties: concentration of cumulative empirical estimates and stability of endpoint-revalidated bisection. These properties are invoked as analysis assumptions rather than being defined in terms of the target bound or fitted to data. No equations or steps are shown that reduce the claimed regret to a self-referential quantity, a fitted parameter renamed as prediction, or a self-citation chain. The derivation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard concentration inequalities for empirical averages and on the stability property of the endpoint-revalidated bisection procedure under the collision model; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

standard math Concentration inequalities apply to the cumulative empirical estimates built by round-robin exploration
Invoked for the analysis of local reward estimates
domain assumption Endpoint-revalidated bisection remains stable for threshold-feasibility tests
Central to the distributed auction and fairness threshold search

pith-pipeline@v0.9.0 · 5782 in / 1409 out tokens · 22989 ms · 2026-05-24T08:39:24.651986+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

[1]

Distributed multi-player bandits-a game of thrones approach,

I. Bistritz and A. Leshem, “Distributed multi-player bandits-a game of thrones approach,” in Advances in Neural Information Processing Systems , pp. 7222–7232, 2018

work page 2018
[2]

Distributed learning in multi-armed bandit with multiple players,

K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” IEEE Transactions on Signal Processing , vol. 58, no. 11, pp. 5667–5681, 2010

work page 2010
[3]

Distributed multi-agent online learning based on global feedback,

J. Xu, C. Tekin, S. Zhang, and M. Van Der Schaar, “Distributed multi-agent online learning based on global feedback,” IEEE Transactions on Signal Processing, vol. 63, no. 9, pp. 2225–2238, 2015

work page 2015
[4]

Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,

S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 759–767, 2013

work page 2013
[5]

Medium access in cognitive radio networks: A competitive multi-armed bandit framework,

L. Lai, H. Jiang, and H. V. Poor, “Medium access in cognitive radio networks: A competitive multi-armed bandit framework,” in Signals, Systems and Computers, 2008 42nd Asilomar Conference on , pp. 98–102, 2008

work page 2008
[6]

Distributed algorithms for learning and cognitive medium access with logarithmic regret,

A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” 16 IEEE Journal on Selected Areas in Communications , vol. 29, no. 4, pp. 731– 745, 2011

work page 2011
[7]

Competing bandits in matching markets,

L. T. Liu, H. Mania, and M. I. Jordan, “Competing bandits in matching markets,” arXiv preprint arXiv:1906.05363 , 2019

work page arXiv 1906
[8]

Learning in a changing world: Restless multi- armed bandit with unknown dynamics,

H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multi- armed bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902–1916, 2013

work page 1902
[9]

Concurrent bandits and cognitive radio net- works,

O. Avner and S. Mannor, “Concurrent bandits and cognitive radio net- works,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 66–81, 2014

work page 2014
[10]

On regret-optimal learning in decen- tralized multiplayer multiarmed bandits,

N. Nayyar, D. Kalathil, and R. Jain, “On regret-optimal learning in decen- tralized multiplayer multiarmed bandits,” IEEE Trans. Control Netw. Syst. , vol. 5, no. 1, pp. 597–606, 2016

work page 2016
[11]

The Effect of Communication on Noncooperative Multiplayer Multi-Armed Bandit Problems

N. Evirgen and A. Kose, “The effect of communication on noncoop- erative multiplayer multi-armed bandit problems,” in arXiv preprint arXiv:1711.01628, 2017, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Learning with bandit feedback in potential games,

J. Cohen, A. H´ eliou, and P. Mertikopoulos, “Learning with bandit feedback in potential games,” in Proceedings of the 31th International Conference on Neural Information Processing Systems, 2017

work page 2017
[13]

Multi-user lax communications: a multi-armed bandit approach,

O. Avner and S. Mannor, “Multi-user lax communications: a multi-armed bandit approach,” in INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, IEEE , pp. 1–9, 2016

work page 2016
[14]

Distributed learning for channel allocation over a shared spectrum,

S. Zafaruddin, I. Bistritz, A. Leshem, and D. Niyato, “Distributed learning for channel allocation over a shared spectrum,”IEEE J. Sel. Areas Commun., vol. 37, no. 10, pp. 2337–2349, 2019

work page 2019
[15]

Multi-Player Bandits: A Trekking Approach

M. K. Hanawal and S. J. Darak, “Multi-player bandits: A trekking approach,” arXiv preprint arXiv:1809.06040 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Cooperative multi-player bandit optimization,

I. Bistritz and N. Bambos, “Cooperative multi-player bandit optimization,” Advances in Neural Information Processing Systems , vol. 33, 2020

work page 2020
[17]

Multi-player bandits–a musical chairs approach,

J. Rosenski, O. Shamir, and L. Szlak, “Multi-player bandits–a musical chairs approach,” in International Conference on Machine Learning , pp. 155–163, 2016

work page 2016
[18]

Sic-mmab: Synchronisation involves commu- nication in multiplayer multi-armed bandits,

E. Boursier and V. Perchet, “Sic-mmab: Synchronisation involves commu- nication in multiplayer multi-armed bandits,” in Adv Neural Inf Process Syst., pp. 12048–12057, 2019

work page 2019
[19]

Multi-player bandits: The adversarial case,

P. Alatur, K. Y. Levy, and A. Krause, “Multi-player bandits: The adversarial case,” J. Mach. Learn. Res. , vol. 21, no. 77, pp. 1–23, 2020. 17

work page 2020
[20]

Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

S. Bubeck, Y. Li, Y. Peres, and M. Sellke, “Non-stochastic multi-player multi-armed bandits: Optimal rate with collision information, sublinear without,” arXiv preprint arXiv:1904.12233 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[21]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems,

S. Bubeck, N. Cesa-Bianchi, et al. , “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012

work page 2012
[22]

Fully distributed optimal channel assignment for open spectrum access,

O. Naparstek and A. Leshem, “Fully distributed optimal channel assignment for open spectrum access,” IEEE Trans. Signal Process. , vol. 62, no. 2, pp. 283–294, 2014

work page 2014
[23]

Opportunistic carrier sensing for energy-efficient information retrieval in sensor networks,

Q. Zhao and L. Tong, “Opportunistic carrier sensing for energy-efficient information retrieval in sensor networks,” EURASIP J. Wirel. Commun. Netw., vol. 2005, no. 2, pp. 1–11, 2005

work page 2005
[24]

Distributed learning for opti- mal spectrum access in dense device-to-device ad-hoc networks,

T. Boyarski, W. Wang, and A. Leshem, “Distributed learning for opti- mal spectrum access in dense device-to-device ad-hoc networks,” IEEE Transactions on Signal Processing, 2023

work page 2023
[25]

Multi-player bandits revisited,

L. Besson and E. Kaufmann, “Multi-player bandits revisited,” inAlgorithmic Learning Theory, pp. 56–92, 2018

work page 2018
[26]

Distributed learning and optimal assignment in multiplayer heterogeneous networks,

H. Tibrewal, S. Patchala, M. K. Hanawal, and S. J. Darak, “Distributed learning and optimal assignment in multiplayer heterogeneous networks,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications , pp. 1693–1701, IEEE, 2019

work page 2019
[27]

Game of thrones: Fully distributed learning for multiplayer bandits,

I. Bistritz and A. Leshem, “Game of thrones: Fully distributed learning for multiplayer bandits,” Mathematics of Operations Research, vol. 46, no. 1, pp. 159–178, 2021

work page 2021
[28]

Decentralized learning for multiplayer multiarmed bandits,

D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory , vol. 60, no. 4, pp. 2331–2345, 2014

work page 2014
[29]

A practical algorithm for multiplayer bandits when arm means vary among players,

A. Mehrabian, E. Boursier, E. Kaufmann, and V. Perchet, “A practical algorithm for multiplayer bandits when arm means vary among players,” in 23rd AISTATS, (online), pp. 1211–1221, PMLR, Aug. 2020

work page 2020
[30]

The bargaining problem,

J. F. Nash Jr, “The bargaining problem,” Econometrica: Journal of the econometric society, pp. 155–162, 1950

work page 1950
[31]

Kubiak, Proportional optimization and fairness , vol

W. Kubiak, Proportional optimization and fairness , vol. 127. Springer Science & Business Media, 2008

work page 2008
[32]

Other solutions to nash’s bargaining prob- lem,

E. Kalai and M. Smorodinsky, “Other solutions to nash’s bargaining prob- lem,” Econometrica: Journal of the Econometric Society , pp. 513–518, 1975. 18

work page 1975
[33]

Properties of pareto optimal allocations of resources to activities,

K. M. Mjelde, “Properties of pareto optimal allocations of resources to activities,” modeling, identification and control, vol. 4, no. 3, pp. 167–173, 1983

work page 1983
[34]

Weighted max-min resource allocation for frequency selective channels,

E. Zehavi, A. Leshem, R. Levanda, and Z. Han, “Weighted max-min resource allocation for frequency selective channels,” IEEE transactions on signal processing, vol. 61, no. 15, pp. 3723–3732, 2013

work page 2013
[35]

A unified framework for max-min and min-max fairness with applications,

B. Radunovic and J.-Y. Le Boudec, “A unified framework for max-min and min-max fairness with applications,” IEEE/ACM Transactions on networking, vol. 15, no. 5, pp. 1073–1083, 2007

work page 2007
[36]

An approximation algorithm for max-min fair allocation of indivisible goods,

A. Asadpour and A. Saberi, “An approximation algorithm for max-min fair allocation of indivisible goods,” SIAM Journal on Computing , vol. 39, no. 7, pp. 2970–2989, 2010

work page 2010
[37]

My fair bandit: Distributed learning of max-min fairness with multi-player bandits,

I. Bistritz, T. Baharav, A. Leshem, and N. Bambos, “My fair bandit: Distributed learning of max-min fairness with multi-player bandits,” in International Conference on Machine Learning , pp. 930–940, PMLR, 2020

work page 2020
[38]

One for all and all for one: Distributed learning of fair allocations with multi-player bandits,

I. Bistritz, T. Z. Baharav, A. Leshem, and N. Bambos, “One for all and all for one: Distributed learning of fair allocations with multi-player bandits,” IEEE J. Sel. Areas Inf. Theory , vol. 2, no. 2, pp. 584–598, 2021

work page 2021
[39]

Multi-player multi-armed bandits for stable allocation in heterogeneous ad-hoc networks,

S. J. Darak and M. K. Hanawal, “Multi-player multi-armed bandits for stable allocation in heterogeneous ad-hoc networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2350–2363, 2019

work page 2019
[40]

Individual regret in cooperative nonstochas- tic multi-armed bandits,

Y. Bar-On and Y. Mansour, “Individual regret in cooperative nonstochas- tic multi-armed bandits,” in Adv Neural Inf Process Syst. (H. Wallach, H. Larochelle, A. Beygelzimer, F. d 'Alch´ e-Buc, E. Fox, and R. Garnett, eds.), vol. 32, pp. 3116–3126, Curran Associates, Inc., 2019

work page 2019
[41]

Fairness in reinforcement learning,

S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth, “Fairness in reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1617–1626, JMLR. org, 2017

work page 2017
[42]

Fairness in learn- ing: Classic and contextual bandits,

M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth, “Fairness in learn- ing: Classic and contextual bandits,” in Advances in Neural Information Processing Systems, pp. 325–333, 2016

work page 2016
[43]

Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness,

X. Zhang, M. Khaliligarekani, C. Tekin, et al., “Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness,” Advances in neural information processing systems , vol. 32, 2019

work page 2019
[44]

Fair end-to-end window-based congestion control,

J. Mo and J. Walrand, “Fair end-to-end window-based congestion control,” IEEE/ACM Transactions on networking , no. 5, pp. 556–567, 2000. 19

work page 2000
[45]

Asymptotically optimal allocation of treatments in sequential experiments,

T. L. Lai and H. Robbins, “Asymptotically optimal allocation of treatments in sequential experiments,” Design of Experiments: Ranking and Selection , pp. 127–142, 1984

work page 1984
[46]

The true sample complexity of identify- ing good arms,

J. Katz-Samuels and K. Jamieson, “The true sample complexity of identify- ing good arms,” in International Conference on Artificial Intelligence and Statistics, pp. 1781–1791, 2020

work page 2020
[47]

Rom and M

R. Rom and M. Sidi, Multiple access protocols: Performance and analysis . Springer Science & Business Media, 2012

work page 2012
[48]

A distributed algorithm for the assignment problem,

D. P. Bertsekas, “A distributed algorithm for the assignment problem,” Lab. for Information and Decision Systems Working Paper, MIT , 1979

work page 1979
[49]

Expected time complexity of the auction algorithm and the push relabel algorithm for maximum bipartite matching on random graphs,

O. Naparstek and A. Leshem, “Expected time complexity of the auction algorithm and the push relabel algorithm for maximum bipartite matching on random graphs,” Random Structures & Algorithms, vol. 48, no. 2, pp. 384– 395, 2016

work page 2016
[50]

An efficient cost scaling algorithm for the assignment problem,

A. V. Goldberg and R. Kennedy, “An efficient cost scaling algorithm for the assignment problem,” Mathematical Programming, vol. 71, no. 2, pp. 153– 177, 1995

work page 1995
[51]

A forward/reverse auction algorithm for asymmetric assignment problems,

D. P. Bertsekas and D. A. Castanon, “A forward/reverse auction algorithm for asymmetric assignment problems,” Computational Optimization and Applications, vol. 1, pp. 277–297, 1992

work page 1992
[52]

The auction algorithm: A distributed relaxation method for the assignment problem,

D. P. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of operations research, vol. 14, no. 1, pp. 105–123, 1988. A Proof of Lemma 4.2 First we compute the probability that our overestimate of the support of one of the arm fails. Let γ = maxm,nF (Bm,n/2) (20) ˆB(K) = maxm,n ˆBm,n(K). (21) By definition...

work page 1988

[1] [1]

Distributed multi-player bandits-a game of thrones approach,

I. Bistritz and A. Leshem, “Distributed multi-player bandits-a game of thrones approach,” in Advances in Neural Information Processing Systems , pp. 7222–7232, 2018

work page 2018

[2] [2]

Distributed learning in multi-armed bandit with multiple players,

K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” IEEE Transactions on Signal Processing , vol. 58, no. 11, pp. 5667–5681, 2010

work page 2010

[3] [3]

Distributed multi-agent online learning based on global feedback,

J. Xu, C. Tekin, S. Zhang, and M. Van Der Schaar, “Distributed multi-agent online learning based on global feedback,” IEEE Transactions on Signal Processing, vol. 63, no. 9, pp. 2225–2238, 2015

work page 2015

[4] [4]

Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,

S. Vakili, K. Liu, and Q. Zhao, “Deterministic sequencing of exploration and exploitation for multi-armed bandit problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 759–767, 2013

work page 2013

[5] [5]

Medium access in cognitive radio networks: A competitive multi-armed bandit framework,

L. Lai, H. Jiang, and H. V. Poor, “Medium access in cognitive radio networks: A competitive multi-armed bandit framework,” in Signals, Systems and Computers, 2008 42nd Asilomar Conference on , pp. 98–102, 2008

work page 2008

[6] [6]

Distributed algorithms for learning and cognitive medium access with logarithmic regret,

A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” 16 IEEE Journal on Selected Areas in Communications , vol. 29, no. 4, pp. 731– 745, 2011

work page 2011

[7] [7]

Competing bandits in matching markets,

L. T. Liu, H. Mania, and M. I. Jordan, “Competing bandits in matching markets,” arXiv preprint arXiv:1906.05363 , 2019

work page arXiv 1906

[8] [8]

Learning in a changing world: Restless multi- armed bandit with unknown dynamics,

H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multi- armed bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902–1916, 2013

work page 1902

[9] [9]

Concurrent bandits and cognitive radio net- works,

O. Avner and S. Mannor, “Concurrent bandits and cognitive radio net- works,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 66–81, 2014

work page 2014

[10] [10]

On regret-optimal learning in decen- tralized multiplayer multiarmed bandits,

N. Nayyar, D. Kalathil, and R. Jain, “On regret-optimal learning in decen- tralized multiplayer multiarmed bandits,” IEEE Trans. Control Netw. Syst. , vol. 5, no. 1, pp. 597–606, 2016

work page 2016

[11] [11]

The Effect of Communication on Noncooperative Multiplayer Multi-Armed Bandit Problems

N. Evirgen and A. Kose, “The effect of communication on noncoop- erative multiplayer multi-armed bandit problems,” in arXiv preprint arXiv:1711.01628, 2017, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Learning with bandit feedback in potential games,

J. Cohen, A. H´ eliou, and P. Mertikopoulos, “Learning with bandit feedback in potential games,” in Proceedings of the 31th International Conference on Neural Information Processing Systems, 2017

work page 2017

[13] [13]

Multi-user lax communications: a multi-armed bandit approach,

O. Avner and S. Mannor, “Multi-user lax communications: a multi-armed bandit approach,” in INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, IEEE , pp. 1–9, 2016

work page 2016

[14] [14]

Distributed learning for channel allocation over a shared spectrum,

S. Zafaruddin, I. Bistritz, A. Leshem, and D. Niyato, “Distributed learning for channel allocation over a shared spectrum,”IEEE J. Sel. Areas Commun., vol. 37, no. 10, pp. 2337–2349, 2019

work page 2019

[15] [15]

Multi-Player Bandits: A Trekking Approach

M. K. Hanawal and S. J. Darak, “Multi-player bandits: A trekking approach,” arXiv preprint arXiv:1809.06040 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Cooperative multi-player bandit optimization,

I. Bistritz and N. Bambos, “Cooperative multi-player bandit optimization,” Advances in Neural Information Processing Systems , vol. 33, 2020

work page 2020

[17] [17]

Multi-player bandits–a musical chairs approach,

J. Rosenski, O. Shamir, and L. Szlak, “Multi-player bandits–a musical chairs approach,” in International Conference on Machine Learning , pp. 155–163, 2016

work page 2016

[18] [18]

Sic-mmab: Synchronisation involves commu- nication in multiplayer multi-armed bandits,

E. Boursier and V. Perchet, “Sic-mmab: Synchronisation involves commu- nication in multiplayer multi-armed bandits,” in Adv Neural Inf Process Syst., pp. 12048–12057, 2019

work page 2019

[19] [19]

Multi-player bandits: The adversarial case,

P. Alatur, K. Y. Levy, and A. Krause, “Multi-player bandits: The adversarial case,” J. Mach. Learn. Res. , vol. 21, no. 77, pp. 1–23, 2020. 17

work page 2020

[20] [20]

Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

S. Bubeck, Y. Li, Y. Peres, and M. Sellke, “Non-stochastic multi-player multi-armed bandits: Optimal rate with collision information, sublinear without,” arXiv preprint arXiv:1904.12233 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[21] [21]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems,

S. Bubeck, N. Cesa-Bianchi, et al. , “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012

work page 2012

[22] [22]

Fully distributed optimal channel assignment for open spectrum access,

O. Naparstek and A. Leshem, “Fully distributed optimal channel assignment for open spectrum access,” IEEE Trans. Signal Process. , vol. 62, no. 2, pp. 283–294, 2014

work page 2014

[23] [23]

Opportunistic carrier sensing for energy-efficient information retrieval in sensor networks,

Q. Zhao and L. Tong, “Opportunistic carrier sensing for energy-efficient information retrieval in sensor networks,” EURASIP J. Wirel. Commun. Netw., vol. 2005, no. 2, pp. 1–11, 2005

work page 2005

[24] [24]

Distributed learning for opti- mal spectrum access in dense device-to-device ad-hoc networks,

T. Boyarski, W. Wang, and A. Leshem, “Distributed learning for opti- mal spectrum access in dense device-to-device ad-hoc networks,” IEEE Transactions on Signal Processing, 2023

work page 2023

[25] [25]

Multi-player bandits revisited,

L. Besson and E. Kaufmann, “Multi-player bandits revisited,” inAlgorithmic Learning Theory, pp. 56–92, 2018

work page 2018

[26] [26]

Distributed learning and optimal assignment in multiplayer heterogeneous networks,

H. Tibrewal, S. Patchala, M. K. Hanawal, and S. J. Darak, “Distributed learning and optimal assignment in multiplayer heterogeneous networks,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications , pp. 1693–1701, IEEE, 2019

work page 2019

[27] [27]

Game of thrones: Fully distributed learning for multiplayer bandits,

I. Bistritz and A. Leshem, “Game of thrones: Fully distributed learning for multiplayer bandits,” Mathematics of Operations Research, vol. 46, no. 1, pp. 159–178, 2021

work page 2021

[28] [28]

Decentralized learning for multiplayer multiarmed bandits,

D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory , vol. 60, no. 4, pp. 2331–2345, 2014

work page 2014

[29] [29]

A practical algorithm for multiplayer bandits when arm means vary among players,

A. Mehrabian, E. Boursier, E. Kaufmann, and V. Perchet, “A practical algorithm for multiplayer bandits when arm means vary among players,” in 23rd AISTATS, (online), pp. 1211–1221, PMLR, Aug. 2020

work page 2020

[30] [30]

The bargaining problem,

J. F. Nash Jr, “The bargaining problem,” Econometrica: Journal of the econometric society, pp. 155–162, 1950

work page 1950

[31] [31]

Kubiak, Proportional optimization and fairness , vol

W. Kubiak, Proportional optimization and fairness , vol. 127. Springer Science & Business Media, 2008

work page 2008

[32] [32]

Other solutions to nash’s bargaining prob- lem,

E. Kalai and M. Smorodinsky, “Other solutions to nash’s bargaining prob- lem,” Econometrica: Journal of the Econometric Society , pp. 513–518, 1975. 18

work page 1975

[33] [33]

Properties of pareto optimal allocations of resources to activities,

K. M. Mjelde, “Properties of pareto optimal allocations of resources to activities,” modeling, identification and control, vol. 4, no. 3, pp. 167–173, 1983

work page 1983

[34] [34]

Weighted max-min resource allocation for frequency selective channels,

E. Zehavi, A. Leshem, R. Levanda, and Z. Han, “Weighted max-min resource allocation for frequency selective channels,” IEEE transactions on signal processing, vol. 61, no. 15, pp. 3723–3732, 2013

work page 2013

[35] [35]

A unified framework for max-min and min-max fairness with applications,

B. Radunovic and J.-Y. Le Boudec, “A unified framework for max-min and min-max fairness with applications,” IEEE/ACM Transactions on networking, vol. 15, no. 5, pp. 1073–1083, 2007

work page 2007

[36] [36]

An approximation algorithm for max-min fair allocation of indivisible goods,

A. Asadpour and A. Saberi, “An approximation algorithm for max-min fair allocation of indivisible goods,” SIAM Journal on Computing , vol. 39, no. 7, pp. 2970–2989, 2010

work page 2010

[37] [37]

My fair bandit: Distributed learning of max-min fairness with multi-player bandits,

I. Bistritz, T. Baharav, A. Leshem, and N. Bambos, “My fair bandit: Distributed learning of max-min fairness with multi-player bandits,” in International Conference on Machine Learning , pp. 930–940, PMLR, 2020

work page 2020

[38] [38]

One for all and all for one: Distributed learning of fair allocations with multi-player bandits,

I. Bistritz, T. Z. Baharav, A. Leshem, and N. Bambos, “One for all and all for one: Distributed learning of fair allocations with multi-player bandits,” IEEE J. Sel. Areas Inf. Theory , vol. 2, no. 2, pp. 584–598, 2021

work page 2021

[39] [39]

Multi-player multi-armed bandits for stable allocation in heterogeneous ad-hoc networks,

S. J. Darak and M. K. Hanawal, “Multi-player multi-armed bandits for stable allocation in heterogeneous ad-hoc networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2350–2363, 2019

work page 2019

[40] [40]

Individual regret in cooperative nonstochas- tic multi-armed bandits,

Y. Bar-On and Y. Mansour, “Individual regret in cooperative nonstochas- tic multi-armed bandits,” in Adv Neural Inf Process Syst. (H. Wallach, H. Larochelle, A. Beygelzimer, F. d 'Alch´ e-Buc, E. Fox, and R. Garnett, eds.), vol. 32, pp. 3116–3126, Curran Associates, Inc., 2019

work page 2019

[41] [41]

Fairness in reinforcement learning,

S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth, “Fairness in reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1617–1626, JMLR. org, 2017

work page 2017

[42] [42]

Fairness in learn- ing: Classic and contextual bandits,

M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth, “Fairness in learn- ing: Classic and contextual bandits,” in Advances in Neural Information Processing Systems, pp. 325–333, 2016

work page 2016

[43] [43]

Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness,

X. Zhang, M. Khaliligarekani, C. Tekin, et al., “Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness,” Advances in neural information processing systems , vol. 32, 2019

work page 2019

[44] [44]

Fair end-to-end window-based congestion control,

J. Mo and J. Walrand, “Fair end-to-end window-based congestion control,” IEEE/ACM Transactions on networking , no. 5, pp. 556–567, 2000. 19

work page 2000

[45] [45]

Asymptotically optimal allocation of treatments in sequential experiments,

T. L. Lai and H. Robbins, “Asymptotically optimal allocation of treatments in sequential experiments,” Design of Experiments: Ranking and Selection , pp. 127–142, 1984

work page 1984

[46] [46]

The true sample complexity of identify- ing good arms,

J. Katz-Samuels and K. Jamieson, “The true sample complexity of identify- ing good arms,” in International Conference on Artificial Intelligence and Statistics, pp. 1781–1791, 2020

work page 2020

[47] [47]

Rom and M

R. Rom and M. Sidi, Multiple access protocols: Performance and analysis . Springer Science & Business Media, 2012

work page 2012

[48] [48]

A distributed algorithm for the assignment problem,

D. P. Bertsekas, “A distributed algorithm for the assignment problem,” Lab. for Information and Decision Systems Working Paper, MIT , 1979

work page 1979

[49] [49]

Expected time complexity of the auction algorithm and the push relabel algorithm for maximum bipartite matching on random graphs,

O. Naparstek and A. Leshem, “Expected time complexity of the auction algorithm and the push relabel algorithm for maximum bipartite matching on random graphs,” Random Structures & Algorithms, vol. 48, no. 2, pp. 384– 395, 2016

work page 2016

[50] [50]

An efficient cost scaling algorithm for the assignment problem,

A. V. Goldberg and R. Kennedy, “An efficient cost scaling algorithm for the assignment problem,” Mathematical Programming, vol. 71, no. 2, pp. 153– 177, 1995

work page 1995

[51] [51]

A forward/reverse auction algorithm for asymmetric assignment problems,

D. P. Bertsekas and D. A. Castanon, “A forward/reverse auction algorithm for asymmetric assignment problems,” Computational Optimization and Applications, vol. 1, pp. 277–297, 1992

work page 1992

[52] [52]

The auction algorithm: A distributed relaxation method for the assignment problem,

D. P. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of operations research, vol. 14, no. 1, pp. 105–123, 1988. A Proof of Lemma 4.2 First we compute the probability that our overestimate of the support of one of the arm fails. Let γ = maxm,nF (Bm,n/2) (20) ˆB(K) = maxm,n ˆBm,n(K). (21) By definition...

work page 1988