Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization

John Wang; Mengfan Xu

arxiv: 2605.06864 · v1 · submitted 2026-05-07 · 💻 cs.LG

Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization

John Wang , Mengfan Xu This is my paper

Pith reviewed 2026-05-11 01:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-objective multi-armed banditsmulti-agent banditsPareto regretNash social welfaregossip algorithmsUCB explorationfairness optimizationregret bounds

0 comments

The pith

Pareto UCB1 Gossip separates uncertainty types to reach O(log T) regret in multi-objective multi-agent bandits while a fairness version using Nash social welfare reaches O(T to the 3/4).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces algorithms for multi-objective multi-agent bandits where agents receive different vector rewards and share information over time-varying networks. It shows how to measure efficiency through Pareto regret and fairness through a Nash social welfare objective over scalarized preferences. Pareto UCB1 Gossip uses an exploration radius that isolates statistical uncertainty from consensus errors to prove strong bounds, while Simulated NSW UCB Gossip adds simulation of preferences and gossip utility estimates to enforce fairness at a moderate cost in convergence speed. A reader would care because many distributed decision systems, from sensor arrays to recommendation engines, need both collective optimality and equitable outcomes without a central coordinator.

Core claim

The paper claims that Pareto UCB1 Gossip, whose novel exploration radius explicitly separates statistical uncertainty in Pareto-based inference from consensus error, achieves O(log T) regret and an instance-independent rate of O(sqrt(T)), while Simulated NSW UCB Gossip, which integrates preference-based reward simulation, gossip-based utility estimation, and UCB-style exploration, achieves an instance-independent regret bound of O(T^{3/4}). This separation reveals the cost of imposing the fairness constraint to the efficiency objective: fairness limits information aggregation and slows convergence.

What carries the argument

The exploration radius in Pareto UCB1 Gossip that separates statistical uncertainty in Pareto-based inference from consensus error, together with preference-based reward simulation and gossip-based utility estimation in Simulated NSW UCB Gossip.

If this is right

Agents can learn Pareto-optimal actions with logarithmic regret under the given network conditions.
Nash social welfare fairness can be added while keeping regret sublinear, though at a slower rate.
The explicit separation of uncertainty sources improves performance compared with methods that do not distinguish them.
Experiments indicate roughly 100 percent better efficiency and 50 percent better fairness than standard baselines.
The same gossip communication model suffices for both efficiency and fairness objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation technique could apply to other distributed learning problems where consensus errors compete with statistical noise.
Relaxing the bounded-variance assumption might require new concentration tools but could cover more realistic sensor data.
The fairness formulation might translate to resource allocation in multi-robot systems where agents have different utility scales.
Scaling the methods to very large agent populations would test whether the O(T^{3/4}) term remains practical.

Load-bearing premise

The analysis assumes stochastic rewards with bounded variance, heterogeneous but fixed mean vectors per agent, and time-varying graphs that remain connected in expectation, with the separation of Pareto uncertainty from consensus error holding in the regret derivation.

What would settle it

Simulations in which the communication graph stays disconnected for long stretches and regret grows faster than the stated rates would show the bounds do not hold.

read the original abstract

We study multi-objective multi-agent multi-armed bandits (MO-MA-MAB) under stochastic rewards, where agents observe heterogeneous reward vectors and communicate over time-varying graphs. We formulate this emerging problem setting to address \emph{efficient learning}, measured by Pareto regret, and incorporate \emph{fair learning} as an additional goal, captured via social welfare. To measure efficiency, we formulate Pareto regret and develop \textsc{Pareto UCB1 Gossip}, whose novel exploration radius explicitly separates statistical uncertainty in Pareto-based inference from consensus error. To express the fairness constraint, we formulate a Nash Social Welfare objective over preference-scalarized rewards and propose \textsc{Simulated NSW UCB Gossip}, which integrates preference-based reward simulation, gossip-based utility estimation, and UCB-style exploration. We prove that \textsc{Pareto UCB1 Gossip} achieves \(\mathcal{O}(\log T)\) regret and an instance-independent rate of \(\mathcal{O}(\sqrt{T})\), while \textsc{Simulated NSW UCB Gossip} achieves an instance-independent regret bound of \(\mathcal{O}(T^{3/4})\). This separation reveals the cost of imposing the fairness constraint to our efficiency objective: fairness limits information aggregation and slows convergence. Experiments show that our methods consistently outperform baselines, improving performance by approximately \(100\%\) and \(50\%\) in the efficiency and fairness settings, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines a fresh MO-MA-MAB problem that mixes Pareto efficiency with Nash social welfare fairness and gives two gossip-UCB algorithms, but the O(log T) regret claim rests on an unproven separation of error sources that may not survive time-varying graphs.

read the letter

The paper sets up multi-objective multi-agent bandits where agents see different reward vectors and talk over changing graphs. It adds fairness via a Nash social welfare objective and proposes Pareto UCB1 Gossip plus Simulated NSW UCB Gossip, with claimed rates of O(log T) and O(T^{3/4}) respectively. Experiments are said to beat baselines by large margins in both efficiency and fairness modes. That combination of goals and the specific algorithm constructions are the actual new pieces; they are not simple extensions of single-agent or single-objective results. The framing of how fairness slows information aggregation is also a useful observation for this subfield. The analysis looks plausible on the surface because it starts from standard UCB concentration and adds a consensus term, but the stress-test point lands. In graphs that connect only in expectation, the history of which arms each agent has pulled affects both the Pareto uncertainty and the gossip error at the same time. The abstract treats these as cleanly additive without showing control over the dependence, so the O(log T) bound could pick up extra factors. The experimental claims are also hard to assess without the graph models, baselines, or variance details. This work is aimed at people already working on distributed bandits who want to add fairness constraints. A reader in that niche would get concrete algorithms and a new problem statement worth thinking about. It deserves a serious referee because the setting is original and the algorithms are spelled out, even though the proofs need direct verification on the cross-term issue. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper studies multi-objective multi-agent multi-armed bandits (MO-MA-MAB) with stochastic rewards, heterogeneous agent means, and time-varying graphs. It introduces Pareto regret as an efficiency metric and Nash Social Welfare over preference-scalarized rewards as a fairness objective. Two algorithms are proposed: Pareto UCB1 Gossip, whose exploration radius separates Pareto statistical uncertainty from consensus error, and Simulated NSW UCB Gossip, which combines preference simulation, gossip utility estimation, and UCB exploration. The central claims are that Pareto UCB1 Gossip attains O(log T) Pareto regret (and O(sqrt(T)) instance-independent) while Simulated NSW UCB Gossip attains O(T^{3/4}) regret; experiments report ~100% and ~50% gains over baselines in the two settings.

Significance. If the regret bounds hold with the stated separation, the work would be significant for jointly treating efficiency and fairness in decentralized multi-objective bandits. The explicit decomposition of uncertainty sources and the explicit cost of the fairness constraint are useful conceptual contributions; the experimental improvements, if reproducible, indicate practical value.

major comments (2)

[§4, Theorem 1] §4 (Regret Analysis), Theorem 1: The O(log T) bound for Pareto UCB1 Gossip is derived by additively separating a statistical Pareto term from a consensus-error term whose expectation is controlled by expected graph connectivity. Under time-varying graphs that are only connected in expectation, arm selections (which determine which agents have observed which arms) are history-dependent and therefore correlated with the consensus process; the manuscript does not exhibit an explicit bound on the resulting cross term or show that it is absorbed into O(log T).
[§4.3] §4.3 (Simulated NSW UCB Gossip analysis): The O(T^{3/4}) instance-independent bound is stated without an accompanying sketch of how the preference-simulation bias, gossip estimation error, and UCB exploration interact under the same time-varying connectivity model; the constant factors and the precise connectivity assumption (e.g., expected mixing time) are not displayed.

minor comments (2)

[Abstract] Abstract: the reported 100% and 50% gains are not accompanied by the precise baselines, metrics, or graph models used; this information should appear in the abstract or be cross-referenced to a table.
[§2] Notation: the definition of the Pareto front and the scalarization weights used for NSW should be stated once in a single preliminary section rather than re-introduced in each algorithm description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. The comments identify areas where the proof sketches can be strengthened with additional explicit bounds and assumptions. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [§4, Theorem 1] §4 (Regret Analysis), Theorem 1: The O(log T) bound for Pareto UCB1 Gossip is derived by additively separating a statistical Pareto term from a consensus-error term whose expectation is controlled by expected graph connectivity. Under time-varying graphs that are only connected in expectation, arm selections (which determine which agents have observed which arms) are history-dependent and therefore correlated with the consensus process; the manuscript does not exhibit an explicit bound on the resulting cross term or show that it is absorbed into O(log T).

Authors: We appreciate the referee's identification of this technical point. In the complete proof of Theorem 1 (Appendix A), the statistical Pareto term is isolated via standard concentration on the empirical Pareto fronts, while the consensus-error term is controlled in expectation by the graph's expected connectivity. The cross term arising from history-dependent arm selections is handled by viewing the process as a martingale with respect to the filtration of past observations and graph realizations; Freedman's inequality is applied conditionally on the connectivity, showing that the cross term is absorbed into the O(log T) leading term without changing the order. To address the concern directly, we will add an explicit auxiliary lemma (Lemma A.3 in the revision) that bounds this cross term separately and confirms it does not exceed O(log T). revision: yes
Referee: [§4.3] §4.3 (Simulated NSW UCB Gossip analysis): The O(T^{3/4}) instance-independent bound is stated without an accompanying sketch of how the preference-simulation bias, gossip estimation error, and UCB exploration interact under the same time-varying connectivity model; the constant factors and the precise connectivity assumption (e.g., expected mixing time) are not displayed.

Authors: We thank the referee for noting the need for a more detailed sketch. The O(T^{3/4}) rate is obtained by balancing three contributions under the assumption that the time-varying graph is connected in expectation with bounded expected mixing time (specifically, the expected second-largest eigenvalue of the gossip matrix is at most 1-δ for a fixed δ>0). Preference-simulation bias contributes O(T^{-1/4}), gossip estimation error (via repeated averaging) contributes O(T^{-1/4}), and the UCB exploration term is controlled at O(T^{1/2} log T) but is down-weighted by the fairness constraint to yield the overall 3/4 exponent. We will expand Section 4.3 with a full proof sketch that displays these interactions, states the precise connectivity assumption, and includes the leading constants in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds combine standard UCB concentration with explicitly separated consensus term

full rationale

The claimed O(log T) and O(sqrt(T)) regret bounds for Pareto UCB1 Gossip are obtained by adding a standard UCB statistical term to a consensus-error term whose expectation is bounded via graph connectivity assumptions. No equation reduces the final bound to a fitted quantity defined from the same data, and no load-bearing step relies on self-citation or imported uniqueness theorems. The separation of Pareto uncertainty from consensus error is introduced as a novel algorithmic feature rather than presupposed. This matches the reader's assessment that the expressions do not collapse by construction, yielding only a minor (score-2) finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on standard stochastic bandit assumptions plus the new problem elements of vector rewards, time-varying graphs, and preference simulation; no new physical entities or ad-hoc constants are introduced beyond typical UCB exploration parameters.

axioms (2)

domain assumption Rewards are stochastic with unknown but fixed mean vectors per agent
Invoked to apply concentration inequalities in the regret analysis
domain assumption Communication occurs over time-varying graphs that permit eventual consensus
Required for the gossip-based utility estimation to converge

pith-pipeline@v0.9.0 · 5543 in / 1272 out tokens · 49892 ms · 2026-05-11T01:09:50.667876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A key novelty is the resulting exploration radius ci,k(t) = sqrt(8 log(t 4 sqrt(D|A*|))/Ti,k(t) + 2 sqrt(N)/1-sqrt(p), which cleanly separates sampling and consensus effects.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We decompose the deviation ... zi,k(t)-μk = (zi,k(t)-μ̃k(t)) + (μ̃k(t)-μk) ... consensus error ... uniform bound of order sqrt(N)/(1-sqrt(ρ))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Besbes, Y

O. Besbes, Y . Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards.Advances in neural information processing systems, 27, 2014

work page 2014
[2]

Bone and S

C. Bone and S. Dragi ´cevi´c. Gis and intelligent agents for multiobjective natural resource allocation: A reinforcement learning approach.Transactions in GIS, 13(3):253–272, 2009

work page 2009
[3]

Busa-Fekete, B

R. Busa-Fekete, B. Szörényi, P. Weng, and S. Mannor. Multi-objective bandits: Optimizing the generalized gini index. InInternational Conference on Machine Learning, pages 625–634. PMLR, 2017

work page 2017
[4]

F. M. Delle Fave, R. Stranders, A. Rogers, and N. Jennings. Bounded decentralised coordination over multiple objectives. 2011

work page 2011
[5]

M. M. Drugan and A. Nowe. Designing multi-objective multi-armed bandits algorithms: A study. InThe 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013

work page 2013
[6]

M. M. Drugan, A. Nowé, and B. Manderick. Pareto upper confidence bounds algorithms: an em- pirical study. In2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–8. IEEE, 2014

work page 2014
[7]

Garivier and E

A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. InInternational conference on algorithmic learning theory, pages 174–188. Springer, 2011

work page 2011
[8]

Grandoni, P

F. Grandoni, P. Krysta, S. Leonardi, and C. Ventre. Utilitarian mechanism design for multiob- jective optimization.SIAM Journal on Computing, 43(4):1263–1290, 2014

work page 2014
[9]

W.-J. Hong, P. Yang, and K. Tang. Evolutionary computation for large-scale multi-objective optimization: A decade of progresses.International Journal of Automation and Computing, 18(2):155–169, 2021

work page 2021
[10]

Hossain, E

S. Hossain, E. Micha, and N. Shah. Fair algorithms for multi-agent multi-armed bandits. Advances in Neural Information Processing Systems, 34:24005–24017, 2021

work page 2021
[11]

Hüyük and C

A. Hüyük and C. Tekin. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives.Machine Learning, 110(6):1233–1266, 2021

work page 2021
[12]

M. Inja, C. Kooijman, M. de Waard, D. M. Roijers, and S. Whiteson. Queued pareto local search for multi-objective optimization. InInternational conference on parallel problem solving from nature, pages 589–599. Springer, 2014

work page 2014
[13]

Jones, H

M. Jones, H. Nguyen, and T. Nguyen. An efficient algorithm for fair multi-agent multi-armed bandit with low regret. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8159–8167, 2023

work page 2023
[14]

Landgren, V

P. Landgren, V . Srivastava, and N. E. Leonard. On distributed cooperative decision-making in multiarmed bandits. In2016 European Control Conference (ECC), pages 243–248. IEEE, 2016

work page 2016
[15]

C.-S. Lee. Multi-objective game-theory models for conflict analysis in reservoir watershed management.Chemosphere, 87(6):608–613, 2012

work page 2012
[16]

J. Liu, H. Qiu, L. Yang, and M. Xu. Distributed multi-agent bandits over erd\h {o} sr\’enyi random networks.arXiv preprint arXiv:2510.22811, 2025

work page arXiv 2025
[17]

S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Graepel. Emergent coordination through competition.arXiv preprint arXiv:1902.07151, 2019

work page arXiv 1902
[18]

Madani and J

K. Madani and J. R. Lund. A monte-carlo game theoretic approach for multi-criteria decision making under uncertainty.Advances in water resources, 34(5):607–616, 2011

work page 2011
[19]

Mannion, J

P. Mannion, J. Duggan, and E. Howley. A theoretical and empirical analysis of reward transfor- mations in multi-objective stochastic games. 2017. 10

work page 2017
[20]

Mouaddib, M

A.-I. Mouaddib, M. Boussard, and M. Bouzid. Towards a formal framework for multi-objective multiagent planning. InProceedings of the 6th international joint conference on Autonomous agents and multiagent systems, pages 1–8, 2007

work page 2007
[21]

R˘adulescu, P

R. R˘adulescu, P. Mannion, D. M. Roijers, and A. Nowé. Multi-objective multi-agent decision making: a utility-based analysis and survey.Autonomous Agents and Multi-Agent Systems, 34(1):10, 2020

work page 2020
[22]

G. D. O. Ramos, R. Radulescu, and A. Nowé. A budged-balanced tolling scheme for efficient equilibria under heterogeneous preferences. InProceedings of the Adaptive and Learning Agents Workshop 2019 (ALA-19) at AAMAS, 2019

work page 2019
[23]

Sankararaman, A

A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1– 35, 2019

work page 2019
[24]

M. Shi. Communication-corruption coupling and verification in cooperative multi-objective bandits.arXiv preprint arXiv:2601.11924, 2026

work page arXiv 2026
[25]

Taylor, I

A. Taylor, I. Dusparic, E. Galvan-Lopez, S. Clarke, and V . Cahill. Accelerating learning in multi-objective systems through transfer learning. In2014 international joint conference on neural networks (IJCNN), pages 2298–2305. IEEE, 2014

work page 2014
[26]

X. Wang, L. Yang, Y .-Z. J. Chen, X. Liu, M. Hajiesmaili, D. Towsley, and J. C. Lui. Achieving near-optimal individual regret & low communications in multi-agent bandits. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[27]

Xu and D

M. Xu and D. Klabjan. Decentralized randomly distributed multi-agent multi-armed bandit with heterogeneous rewards.Advances in Neural Information Processing Systems, 36:74799–74855, 2023

work page 2023
[28]

M. Xu, L. Shan, F. Ghaffari, X. Wang, X. Liu, and M. Hajiesmaili. Heterogeneous multi-agent multi-armed bandit on stochastic block models.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–59, 2025

work page 2025
[29]

S. Q. Yahyaa, M. M. Drugan, and B. Manderick. Annealing-pareto multi-objective multi- armed bandit algorithm. In2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–8. IEEE, 2014

work page 2014
[30]

Zhang, R

M. Zhang, R. Deo-Campo Vuong, and H. Luo. No-regret learning for fair multi-agent social welfare optimization.Advances in Neural Information Processing Systems, 37:57671–57700, 2024

work page 2024
[31]

J. Zhu, R. Sandhu, and J. Liu. A distributed algorithm for sequential decision making in multi-armed bandit with homogeneous rewards. In2020 59th IEEE Conference on Decision and Control (CDC), pages 3078–3083. IEEE, 2020. 11 A Technical appendices and supplementary material A.1 Related Work MA-MABThe study of multi-agent multi-armed bandits (MA-MAB) with ...

work page 2020
[32]

Despite these advances, existing approaches remain largely confined to single-agent settings and focus on Pareto optimality as the primary notion of performance

incorporate alternative preference structures like lexicographic ordering, though they similarly operate within a Pareto-based framework. Despite these advances, existing approaches remain largely confined to single-agent settings and focus on Pareto optimality as the primary notion of performance. A complementary line of work from multi-objective optimiz...

work page
[33]

warm-up period

similarly evaluates performance relative to a fixed scalarization rather than directly addressing multi-objective trade-offs. In contrast to existing MO-MAB literature, which primarily focuses on Pareto-based optimality, we adopt a utility-based perspective for evaluating vector-valued rewards. By modeling agent- specific preferences through preference ve...

work page

[1] [1]

Besbes, Y

O. Besbes, Y . Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards.Advances in neural information processing systems, 27, 2014

work page 2014

[2] [2]

Bone and S

C. Bone and S. Dragi ´cevi´c. Gis and intelligent agents for multiobjective natural resource allocation: A reinforcement learning approach.Transactions in GIS, 13(3):253–272, 2009

work page 2009

[3] [3]

Busa-Fekete, B

R. Busa-Fekete, B. Szörényi, P. Weng, and S. Mannor. Multi-objective bandits: Optimizing the generalized gini index. InInternational Conference on Machine Learning, pages 625–634. PMLR, 2017

work page 2017

[4] [4]

F. M. Delle Fave, R. Stranders, A. Rogers, and N. Jennings. Bounded decentralised coordination over multiple objectives. 2011

work page 2011

[5] [5]

M. M. Drugan and A. Nowe. Designing multi-objective multi-armed bandits algorithms: A study. InThe 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013

work page 2013

[6] [6]

M. M. Drugan, A. Nowé, and B. Manderick. Pareto upper confidence bounds algorithms: an em- pirical study. In2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–8. IEEE, 2014

work page 2014

[7] [7]

Garivier and E

A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. InInternational conference on algorithmic learning theory, pages 174–188. Springer, 2011

work page 2011

[8] [8]

Grandoni, P

F. Grandoni, P. Krysta, S. Leonardi, and C. Ventre. Utilitarian mechanism design for multiob- jective optimization.SIAM Journal on Computing, 43(4):1263–1290, 2014

work page 2014

[9] [9]

W.-J. Hong, P. Yang, and K. Tang. Evolutionary computation for large-scale multi-objective optimization: A decade of progresses.International Journal of Automation and Computing, 18(2):155–169, 2021

work page 2021

[10] [10]

Hossain, E

S. Hossain, E. Micha, and N. Shah. Fair algorithms for multi-agent multi-armed bandits. Advances in Neural Information Processing Systems, 34:24005–24017, 2021

work page 2021

[11] [11]

Hüyük and C

A. Hüyük and C. Tekin. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives.Machine Learning, 110(6):1233–1266, 2021

work page 2021

[12] [12]

M. Inja, C. Kooijman, M. de Waard, D. M. Roijers, and S. Whiteson. Queued pareto local search for multi-objective optimization. InInternational conference on parallel problem solving from nature, pages 589–599. Springer, 2014

work page 2014

[13] [13]

Jones, H

M. Jones, H. Nguyen, and T. Nguyen. An efficient algorithm for fair multi-agent multi-armed bandit with low regret. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8159–8167, 2023

work page 2023

[14] [14]

Landgren, V

P. Landgren, V . Srivastava, and N. E. Leonard. On distributed cooperative decision-making in multiarmed bandits. In2016 European Control Conference (ECC), pages 243–248. IEEE, 2016

work page 2016

[15] [15]

C.-S. Lee. Multi-objective game-theory models for conflict analysis in reservoir watershed management.Chemosphere, 87(6):608–613, 2012

work page 2012

[16] [16]

J. Liu, H. Qiu, L. Yang, and M. Xu. Distributed multi-agent bandits over erd\h {o} sr\’enyi random networks.arXiv preprint arXiv:2510.22811, 2025

work page arXiv 2025

[17] [17]

S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Graepel. Emergent coordination through competition.arXiv preprint arXiv:1902.07151, 2019

work page arXiv 1902

[18] [18]

Madani and J

K. Madani and J. R. Lund. A monte-carlo game theoretic approach for multi-criteria decision making under uncertainty.Advances in water resources, 34(5):607–616, 2011

work page 2011

[19] [19]

Mannion, J

P. Mannion, J. Duggan, and E. Howley. A theoretical and empirical analysis of reward transfor- mations in multi-objective stochastic games. 2017. 10

work page 2017

[20] [20]

Mouaddib, M

A.-I. Mouaddib, M. Boussard, and M. Bouzid. Towards a formal framework for multi-objective multiagent planning. InProceedings of the 6th international joint conference on Autonomous agents and multiagent systems, pages 1–8, 2007

work page 2007

[21] [21]

R˘adulescu, P

R. R˘adulescu, P. Mannion, D. M. Roijers, and A. Nowé. Multi-objective multi-agent decision making: a utility-based analysis and survey.Autonomous Agents and Multi-Agent Systems, 34(1):10, 2020

work page 2020

[22] [22]

G. D. O. Ramos, R. Radulescu, and A. Nowé. A budged-balanced tolling scheme for efficient equilibria under heterogeneous preferences. InProceedings of the Adaptive and Learning Agents Workshop 2019 (ALA-19) at AAMAS, 2019

work page 2019

[23] [23]

Sankararaman, A

A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1– 35, 2019

work page 2019

[24] [24]

M. Shi. Communication-corruption coupling and verification in cooperative multi-objective bandits.arXiv preprint arXiv:2601.11924, 2026

work page arXiv 2026

[25] [25]

Taylor, I

A. Taylor, I. Dusparic, E. Galvan-Lopez, S. Clarke, and V . Cahill. Accelerating learning in multi-objective systems through transfer learning. In2014 international joint conference on neural networks (IJCNN), pages 2298–2305. IEEE, 2014

work page 2014

[26] [26]

X. Wang, L. Yang, Y .-Z. J. Chen, X. Liu, M. Hajiesmaili, D. Towsley, and J. C. Lui. Achieving near-optimal individual regret & low communications in multi-agent bandits. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022

[27] [27]

Xu and D

M. Xu and D. Klabjan. Decentralized randomly distributed multi-agent multi-armed bandit with heterogeneous rewards.Advances in Neural Information Processing Systems, 36:74799–74855, 2023

work page 2023

[28] [28]

M. Xu, L. Shan, F. Ghaffari, X. Wang, X. Liu, and M. Hajiesmaili. Heterogeneous multi-agent multi-armed bandit on stochastic block models.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–59, 2025

work page 2025

[29] [29]

S. Q. Yahyaa, M. M. Drugan, and B. Manderick. Annealing-pareto multi-objective multi- armed bandit algorithm. In2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–8. IEEE, 2014

work page 2014

[30] [30]

Zhang, R

M. Zhang, R. Deo-Campo Vuong, and H. Luo. No-regret learning for fair multi-agent social welfare optimization.Advances in Neural Information Processing Systems, 37:57671–57700, 2024

work page 2024

[31] [31]

J. Zhu, R. Sandhu, and J. Liu. A distributed algorithm for sequential decision making in multi-armed bandit with homogeneous rewards. In2020 59th IEEE Conference on Decision and Control (CDC), pages 3078–3083. IEEE, 2020. 11 A Technical appendices and supplementary material A.1 Related Work MA-MABThe study of multi-agent multi-armed bandits (MA-MAB) with ...

work page 2020

[32] [32]

Despite these advances, existing approaches remain largely confined to single-agent settings and focus on Pareto optimality as the primary notion of performance

incorporate alternative preference structures like lexicographic ordering, though they similarly operate within a Pareto-based framework. Despite these advances, existing approaches remain largely confined to single-agent settings and focus on Pareto optimality as the primary notion of performance. A complementary line of work from multi-objective optimiz...

work page

[33] [33]

warm-up period

similarly evaluates performance relative to a fixed scalarization rather than directly addressing multi-objective trade-offs. In contrast to existing MO-MAB literature, which primarily focuses on Pareto-based optimality, we adopt a utility-based perspective for evaluating vector-valued rewards. By modeling agent- specific preferences through preference ve...

work page