A Multi-Agent system for Multi-Objective constrained optimization
Pith reviewed 2026-06-26 17:17 UTC · model grok-4.3
The pith
MAMO treats reward weight selection as a separate learning problem solved by multi-agent reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem solved through multi-agent RL, providing a first step toward more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.
What carries the argument
MAMO, the multi-agent RL system that learns reward weights separately from the primary policy.
If this is right
- Weight selection no longer requires repeated manual intervention when environments change.
- The primary task policy can focus on execution while a separate process handles objective trade-offs.
- RL solutions become more adaptable to shifts in the relative importance of costs versus constraints.
- The approach opens a path to fully automated deployment of constrained optimizers at runtime.
Where Pith is reading between the lines
- Similar separation of concerns could apply to other hyperparameter choices that currently need manual adjustment in RL.
- The method might generalize to settings where multiple objectives compete and their priorities evolve over time.
- If stable, it reduces the expertise barrier for applying RL to real networking or computing systems.
Load-bearing premise
A multi-agent formulation can learn suitable reward weights in a stable and effective manner that outperforms or reliably replaces manual selection without introducing new coordination or convergence issues.
What would settle it
A controlled test in a non-stationary environment where the MAMO-learned weights produce policies with higher constraint violations or worse primary costs than carefully hand-tuned weights on the same problem.
Figures
read the original abstract
Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MAMO, a multi-agent RL system that formulates the selection of reward weights (in a Lagrangian-style penalty for constrained cost-minimization problems) as a separate learning task. This is intended to decouple policy execution from manual objective design, yielding more autonomous RL solutions for non-stationary constrained optimization in computing and networking systems.
Significance. If a stable multi-agent weight learner could be shown to outperform manual tuning without introducing coordination failures or convergence problems, the approach would address a practical pain point in applying RL to constrained problems. However, the manuscript supplies no agent architecture, reward definitions, training procedure, or empirical results, so the significance cannot be assessed beyond the conceptual level.
major comments (2)
- The central claim (decoupling via multi-agent weight learning) is load-bearing yet unsupported: the abstract and manuscript contain no description of the per-agent state/action spaces, the individual or joint reward signals, the mechanism that maps learned weights back to the original constrained RL task, or any stability argument. Without these elements the weakest assumption (stable, effective learning without new coordination issues) cannot be evaluated.
- No experimental section, baseline comparisons, or ablation studies are present to test whether the multi-agent formulation improves autonomy or robustness over manual weight selection in non-stationary environments.
minor comments (2)
- Abstract contains a typographical error: "!rst" should be "first".
- Notation for the underlying constrained RL problem (costs, constraints, Lagrangian weights) is never formalized, making it difficult to understand how the multi-agent component interfaces with the original task.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our conceptual proposal for MAMO. We agree that the manuscript is currently at a high-level conceptual stage and will expand it substantially in revision to address the points raised.
read point-by-point responses
-
Referee: The central claim (decoupling via multi-agent weight learning) is load-bearing yet unsupported: the abstract and manuscript contain no description of the per-agent state/action spaces, the individual or joint reward signals, the mechanism that maps learned weights back to the original constrained RL task, or any stability argument. Without these elements the weakest assumption (stable, effective learning without new coordination issues) cannot be evaluated.
Authors: We acknowledge that the manuscript presents the MAMO idea at a conceptual level and does not yet specify per-agent state/action spaces, reward signals, the weight-mapping mechanism to the constrained task, or stability arguments. This was a deliberate choice to focus on the high-level decoupling insight for non-stationary constrained optimization. In the revised manuscript we will add a dedicated section detailing the multi-agent formulation, including state and action spaces for the weight-selection agents, individual and joint reward definitions, the mechanism for applying learned weights to the Lagrangian penalty, and a discussion of coordination and stability considerations. revision: yes
-
Referee: No experimental section, baseline comparisons, or ablation studies are present to test whether the multi-agent formulation improves autonomy or robustness over manual weight selection in non-stationary environments.
Authors: The current manuscript is a short conceptual paper introducing the MAMO framework without empirical results. We agree that validation against manual tuning and other approaches is essential to assess practical benefits in non-stationary settings. In the revision we will include an experimental section with case studies from computing/networking domains, baseline comparisons (including manual weight selection and single-agent alternatives), and ablations on the multi-agent components to evaluate gains in autonomy and robustness. revision: yes
Circularity Check
No derivation chain or equations present; abstract-only text supplies no load-bearing steps
full rationale
The supplied document contains only the abstract, which states the MAMO idea at a conceptual level without equations, fitted parameters, self-citations, or any derivation that could reduce to its inputs. No self-definitional, fitted-input, or uniqueness claims appear. The central premise (formulating weight selection as a learning problem) is asserted but not constructed via any chain that could be inspected for circularity. Per the rules, when no formal derivation exists the finding is no significant circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher
-
[2]
In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol
Dynamic Weights in Multi-Objective Deep Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhut- dinov (Eds.). PMLR, 11–20. https://proceedings.mlr.press/v97/abels19a.html
-
[3]
D. Ardagna, G. Casale, M. Ciavotta, J. F. Pérez, and W. Wang. 2014. Quality- of-service in cloud computing: modeling techniques and their applications.J Internet Serv Appl5 (2014), 17. https://doi.org/10.1186/s13174-014-0011-3
-
[4]
M.A. Bragin. 2024. Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities.Ann Oper Res333 (2024), 29–45. https://doi.org/10.1007/s10479-023-05499-9
-
[5]
Valeria Cardellini, Francesco Lo Presti, Matteo Nardelli, and Gabriele Russo Russo
-
[6]
Decentralized self-adaptation for elastic Data Stream Processing.Future Generation Computer Systems87 (2018), 171–185. https://doi.org/10.1016/j.future. 2018.05.025
-
[7]
Riccardo Cavadini, Hamta Sedghani, Federica Filippini, and Danilo Ardagna
-
[8]
In: European Conference on Com- puter Vision
Runtime Management of Artificial Intelligence Applications Through Hierarchical Reinforcement Learning. InPerformance Evaluation Methodologies and Tools, Marco Gribaudo, Mauro Iacono, and Sahra Sedigh Sarvestani (Eds.). Springer Nature Switzerland, Cham, 252–273. https://doi.org/10.1007/978-3- 032-06818-7_14
-
[9]
Xi Chen, Ali Ghadirzadeh, Mårten Björkman, and Patric Jensfelt. 2019. Meta- Learning for Multi-objective Reinforcement Learning. In2019 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). 977–983. https: //doi.org/10.1109/IROS40897.2019.8968092
-
[10]
Schahram Dustdar, Victor Casamayor Pujol, and Praveen Kumar Donta. 2023. On Distributed Computing Continuum Systems.IEEE Transactions on Knowledge and Data Engineering35, 4 (2023), 4092–4105. https://doi.org/10.1109/TKDE. 2022.3142856
-
[11]
Federica Filippini, Jonatha Anselmi, Danilo Ardagna, and Bruno Gaujal. 2024. A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems. IEEE Transactions on Cloud Computing12, 01 (01 2024), 53–69. https://doi.org/ 10.1109/TCC.2023.3336540
-
[12]
Federica Filippini, Marco Lattuada, Michele Ciavotta, Arezoo Jahani, Danilo Ardagna, and Edoardo Amaldi. 2023. A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems.IEEE Transactions on Services Computing16, 3 (2023), 1630–1646. https://doi.org/10.1109/TSC.2022.3188440
-
[13]
Federica Filippini, Hamta Sedghani, and Danilo Ardagna. 2023. SPACE4AI- R: Runtime Management Tool for AI Applications Component Placement and Resource Selection in Computing Continua. InProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing(Taormina (Messina), Italy,)(UCC ’23). Association for Computing Machinery, New Yo...
-
[14]
Jordan, Philip S
Dhawal Gupta, Yash Chandak, Scott M. Jordan, Philip S. Thomas, and Bruno Cas- tro da Silva. 2023. Behavior alignment via reward function optimization. In Proceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2297, 33 pages
2023
-
[15]
A practical guide to multi-objective reinforcement learning and planning,
Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. 2022. A practical guide to multi-objec...
-
[16]
Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, and Changjie Fan. 2020. Learning to utilize shaping rewards: a new approach of reward shaping. InProceedings of the 34th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Artic...
2020
-
[17]
Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C
Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Mor- cos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel
-
[18]
https: //doi.org/10.1126/science.aau6249
Human-level performance in 3D multiplayer games with population- based reinforcement learning.Science364, 6443 (2019), 859–865. https: //doi.org/10.1126/science.aau6249
-
[19]
Abednego Wamuhindo Kambale, Hamta Sedghani, Federica Filippini, Giacomo Verticale, and Danilo Ardagna. 2026. Tabular Reinforcement Learning Methods for Artificial Intelligence Tasks Offloading in Smart Eye-Wears.ACM Trans. Auton. Adapt. Syst.21, 1, Article 3 (March 2026), 38 pages. https://doi.org/10. 1145/3771092
2026
-
[20]
X. Li, L. Pan, and S. Liu. 2023. A DRL-based online VM scheduler for cost optimization in cloud brokers. InWorld Wide Web 26. 2399–2425. https://doi. org/10.1007/s11280-023-01145-3
-
[21]
Bingyao Liu, Satinder Singh, Richard L. Lewis, and Shiyin Qin. 2014. Optimal Rewards for Cooperative Agents.IEEE Transactions on Autonomous Mental Development6, 4 (2014), 286–297. https://doi.org/10.1109/TAMD.2014.2362682
-
[22]
Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, and Meng Jiang. 2025. Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting. arXiv:2509.11452 [cs.LG] https://arxiv.org/abs/2509.11452
arXiv 2025
-
[23]
Nima Mahmoudi and Hamzeh Khazaei. 2020. Performance Modeling of Serverless Computing Platforms.IEEE Transactions on Cloud Computing10, 4 (2020), 2834–
2020
-
[24]
https://doi.org/10.1109/TCC.2020.3033373
-
[25]
Octavio Pappalardo, Rodrigo Ramele, and Juan Miguel Santos. 2024. Black box meta-learning intrinsic rewards for sparse-reward environments. arXiv:2407.21546 [cs.LG] https://arxiv.org/abs/2407.21546
arXiv 2024
-
[26]
Emanuele Petriglia, Federica Filippini, Michele Ciavotta, and Marco Savi. 2025. Multi-Agent Reinforcement Learning for Workload Distribution in FaaS-Edge Computing Systems. In2025 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW). 1128–1131. https://doi.org/10.1109/ IPDPSW66978.2025.00176
arXiv 2025
-
[27]
Hamta Sedghani, Federica Filippini, and Danilo Ardagna. 2024. SPACE4AI- D: A Design-Time Tool for AI Applications Resource Selection in Computing Continua.IEEE Transactions on Services Computing17, 6 (2024), 4324–4339. https://doi.org/10.1109/TSC.2024.3479935
-
[28]
Hassam Ullah Sheikh, Shauharda Khadka, Santiago Miret, Somdeb Majumdar, and Mariano Phielipp. 2022. Learning Intrinsic Symbolic Rewards in Reinforcement Learning. In2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892256
-
[29]
Li Shi, Zhemin Zhang, and Thomas Robertazzi. 2017. Energy-Aware Scheduling of Embarrassingly Parallel Jobs and Resource Allocation in Cloud.IEEE Transactions on Parallel and Distributed Systems28, 6 (2017), 1607–1620. https://doi.org/10. 1109/TPDS.2016.2625254
arXiv 2017
-
[30]
Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. 2010. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transactions on Autonomous Mental Development2, 2 (2010), 70–82. https: //doi.org/10.1109/TAMD.2010.2051031
-
[31]
Jonathan Sorg, Satinder Singh, and Richard L. Lewis. 2010. Reward design via online Gradient ascent. InProceedings of the 24th International Conference on Neural Information Processing Systems - Volume 2(Vancouver, British Columbia, Canada)(NIPS’10). Curran Associates Inc., Red Hook, NY, USA, 2190–2198
2010
-
[32]
Atefeh Talebian, Alvin Valera, Jyoti Sahni, and Winston K. G. Seah. 2022. Ro- bust Intra-Slice Migration in Fog Computing. In2022 IEEE 47th Conference on Local Computer Networks (LCN). 48–55. https://doi.org/10.1109/LCN53696.2022. 9843470
-
[33]
Alessandro Tundo, Federica Filippini, Francesco Regonesi, Michele Ciavotta, and Marco Savi. 2025. Decentralized Edge Workload Forecasting With Gossip Learning.IEEE Transactions on Network and Service Management22, 4 (2025), 3016–3031. https://doi.org/10.1109/TNSM.2025.3570450
-
[34]
Pengwei Wang, Yi Li, Chao Fang, Yichen Zhong, and Zhijun Ding. 2025. Opti- mizing Serverless Performance Through Game Theory and Efficient Resource Scheduling.IEEE Trans. Comput.74, 6 (2025), 1990–2002. https://doi.org/10.1109/ TC.2025.3547158
arXiv 2025
-
[35]
Qian Wang, Rajan Batta, Joyendu Bhadury, and Christopher M. Rump. 2003. Budget constrained location problem with opening and closing of facilities.Com- put. Oper. Res.30, 13 (Nov. 2003), 2047–2069. https://doi.org/10.1016/S0305- 0548(02)00123-5
-
[36]
Zhongwen Xu, Hado van Hasselt, and David Silver. 2018. Meta-gradient rein- forcement learning. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 2402–2413
2018
-
[37]
Chen Yang, Guangkai Yang, and Junge Zhang. 2023. Learning Individual Differ- ence Rewards in Multi-Agent Reinforcement Learning. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems(London, United Kingdom)(AAMAS ’23). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2418–2420
2023
-
[38]
L. Zanussi, D. Tessera, L. Massari, and M.C. Calzarossa. 2024. Workflow Sched- uling in the Cloud-Edge Continuum. InLecture Notes on Data Engineering and Communications Technologies (AINA 2024, Vol. 203), Cham Springer (Ed.). 182–190. https://doi.org/10.1007/978-3-031-57931-8_18
-
[39]
Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018. On learning intrinsic rewards for policy gradient methods. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 4649–4659
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.