pith. sign in

arxiv: 2606.20236 · v1 · pith:6MBR4ZNHnew · submitted 2026-06-18 · 💻 cs.AI · cs.LG· cs.MA

A Multi-Agent system for Multi-Objective constrained optimization

Pith reviewed 2026-06-26 17:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA
keywords multi-agent reinforcement learningconstrained optimizationreward weightsmulti-objective optimizationdynamic environmentsLagrangian formulation
0
0 comments X

The pith

MAMO treats reward weight selection as a separate learning problem solved by multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how reinforcement learning for cost-minimization problems with performance constraints usually requires manual choice of weights on penalty terms. These weights determine the balance between the main goal and avoiding violations, yet their best values shift in non-stationary settings. MAMO moves the weight choice into its own multi-agent learning process so that task execution and objective balancing become distinct. This separation is presented as an initial move toward RL agents that can handle constrained optimization without repeated human tuning.

Core claim

MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem solved through multi-agent RL, providing a first step toward more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

What carries the argument

MAMO, the multi-agent RL system that learns reward weights separately from the primary policy.

If this is right

  • Weight selection no longer requires repeated manual intervention when environments change.
  • The primary task policy can focus on execution while a separate process handles objective trade-offs.
  • RL solutions become more adaptable to shifts in the relative importance of costs versus constraints.
  • The approach opens a path to fully automated deployment of constrained optimizers at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar separation of concerns could apply to other hyperparameter choices that currently need manual adjustment in RL.
  • The method might generalize to settings where multiple objectives compete and their priorities evolve over time.
  • If stable, it reduces the expertise barrier for applying RL to real networking or computing systems.

Load-bearing premise

A multi-agent formulation can learn suitable reward weights in a stable and effective manner that outperforms or reliably replaces manual selection without introducing new coordination or convergence issues.

What would settle it

A controlled test in a non-stationary environment where the MAMO-learned weights produce policies with higher constraint violations or worse primary costs than carefully hand-tuned weights on the same problem.

Figures

Figures reproduced from arXiv: 2606.20236 by Federica Filippini.

Figure 1
Figure 1. Figure 1: The replica scaling problem in edge-FaaS systems [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MAMO architecture time scales and abstraction levels, treating the selection of the reward-weighting coefficients as a learning problem rather than as a fixed design choice. To this end, the framework is composed of two agents with complementary roles, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 𝑛𝑓 and 𝑝(𝜆𝑓 , 𝑛𝑓 ) observed solving Problem (P1) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 𝑛𝑓 and 𝑝(𝜆𝑓 , 𝑛𝑓 ) after training TE with 𝑤 = 0.99 (left) and 𝑤 = 0.1 (right); evaluation lasts 600 steps [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 𝑝(𝜆𝑓 , 𝑛𝑓 ) while the WA agent training proceeds (above) and detailed solution (𝑛𝑓 and 𝑝(𝜆𝑓 , 𝑛𝑓 )) in a repre￾sentative scenario with 𝑤 = 0.85 (middle and below) probability of rejections 𝑝 and the average costs over the last 300 steps, and receives a positive reward if 𝑝 < 0.05. Both agents are trained using RL4CC5 , an open-source library that extends Ray RLlib6 . They are implemented using Deep Q￾Learn… view at source ↗
read the original abstract

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MAMO, a multi-agent RL system that formulates the selection of reward weights (in a Lagrangian-style penalty for constrained cost-minimization problems) as a separate learning task. This is intended to decouple policy execution from manual objective design, yielding more autonomous RL solutions for non-stationary constrained optimization in computing and networking systems.

Significance. If a stable multi-agent weight learner could be shown to outperform manual tuning without introducing coordination failures or convergence problems, the approach would address a practical pain point in applying RL to constrained problems. However, the manuscript supplies no agent architecture, reward definitions, training procedure, or empirical results, so the significance cannot be assessed beyond the conceptual level.

major comments (2)
  1. The central claim (decoupling via multi-agent weight learning) is load-bearing yet unsupported: the abstract and manuscript contain no description of the per-agent state/action spaces, the individual or joint reward signals, the mechanism that maps learned weights back to the original constrained RL task, or any stability argument. Without these elements the weakest assumption (stable, effective learning without new coordination issues) cannot be evaluated.
  2. No experimental section, baseline comparisons, or ablation studies are present to test whether the multi-agent formulation improves autonomy or robustness over manual weight selection in non-stationary environments.
minor comments (2)
  1. Abstract contains a typographical error: "!rst" should be "first".
  2. Notation for the underlying constrained RL problem (costs, constraints, Lagrangian weights) is never formalized, making it difficult to understand how the multi-agent component interfaces with the original task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our conceptual proposal for MAMO. We agree that the manuscript is currently at a high-level conceptual stage and will expand it substantially in revision to address the points raised.

read point-by-point responses
  1. Referee: The central claim (decoupling via multi-agent weight learning) is load-bearing yet unsupported: the abstract and manuscript contain no description of the per-agent state/action spaces, the individual or joint reward signals, the mechanism that maps learned weights back to the original constrained RL task, or any stability argument. Without these elements the weakest assumption (stable, effective learning without new coordination issues) cannot be evaluated.

    Authors: We acknowledge that the manuscript presents the MAMO idea at a conceptual level and does not yet specify per-agent state/action spaces, reward signals, the weight-mapping mechanism to the constrained task, or stability arguments. This was a deliberate choice to focus on the high-level decoupling insight for non-stationary constrained optimization. In the revised manuscript we will add a dedicated section detailing the multi-agent formulation, including state and action spaces for the weight-selection agents, individual and joint reward definitions, the mechanism for applying learned weights to the Lagrangian penalty, and a discussion of coordination and stability considerations. revision: yes

  2. Referee: No experimental section, baseline comparisons, or ablation studies are present to test whether the multi-agent formulation improves autonomy or robustness over manual weight selection in non-stationary environments.

    Authors: The current manuscript is a short conceptual paper introducing the MAMO framework without empirical results. We agree that validation against manual tuning and other approaches is essential to assess practical benefits in non-stationary settings. In the revision we will include an experimental section with case studies from computing/networking domains, baseline comparisons (including manual weight selection and single-agent alternatives), and ablations on the multi-agent components to evaluate gains in autonomy and robustness. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; abstract-only text supplies no load-bearing steps

full rationale

The supplied document contains only the abstract, which states the MAMO idea at a conceptual level without equations, fitted parameters, self-citations, or any derivation that could reduce to its inputs. No self-definitional, fitted-input, or uniqueness claims appear. The central premise (formulating weight selection as a learning problem) is asserted but not constructed via any chain that could be inspected for circularity. Per the rules, when no formal derivation exists the finding is no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5693 in / 913 out tokens · 20005 ms · 2026-06-26T17:17:21.873718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 21 canonical work pages

  1. [1]

    Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher

  2. [2]

    In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

    Dynamic Weights in Multi-Objective Deep Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhut- dinov (Eds.). PMLR, 11–20. https://proceedings.mlr.press/v97/abels19a.html

  3. [3]

    Ardagna, G

    D. Ardagna, G. Casale, M. Ciavotta, J. F. Pérez, and W. Wang. 2014. Quality- of-service in cloud computing: modeling techniques and their applications.J Internet Serv Appl5 (2014), 17. https://doi.org/10.1186/s13174-014-0011-3

  4. [4]

    M.A. Bragin. 2024. Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities.Ann Oper Res333 (2024), 29–45. https://doi.org/10.1007/s10479-023-05499-9

  5. [5]

    Valeria Cardellini, Francesco Lo Presti, Matteo Nardelli, and Gabriele Russo Russo

  6. [6]

    Kent Wenger

    Decentralized self-adaptation for elastic Data Stream Processing.Future Generation Computer Systems87 (2018), 171–185. https://doi.org/10.1016/j.future. 2018.05.025

  7. [7]

    Riccardo Cavadini, Hamta Sedghani, Federica Filippini, and Danilo Ardagna

  8. [8]

    In: European Conference on Com- puter Vision

    Runtime Management of Artificial Intelligence Applications Through Hierarchical Reinforcement Learning. InPerformance Evaluation Methodologies and Tools, Marco Gribaudo, Mauro Iacono, and Sahra Sedigh Sarvestani (Eds.). Springer Nature Switzerland, Cham, 252–273. https://doi.org/10.1007/978-3- 032-06818-7_14

  9. [9]

    Xi Chen, Ali Ghadirzadeh, Mårten Björkman, and Patric Jensfelt. 2019. Meta- Learning for Multi-objective Reinforcement Learning. In2019 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). 977–983. https: //doi.org/10.1109/IROS40897.2019.8968092

  10. [10]

    Schahram Dustdar, Victor Casamayor Pujol, and Praveen Kumar Donta. 2023. On Distributed Computing Continuum Systems.IEEE Transactions on Knowledge and Data Engineering35, 4 (2023), 4092–4105. https://doi.org/10.1109/TKDE. 2022.3142856

  11. [11]

    Federica Filippini, Jonatha Anselmi, Danilo Ardagna, and Bruno Gaujal. 2024. A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems. IEEE Transactions on Cloud Computing12, 01 (01 2024), 53–69. https://doi.org/ 10.1109/TCC.2023.3336540

  12. [12]

    Federica Filippini, Marco Lattuada, Michele Ciavotta, Arezoo Jahani, Danilo Ardagna, and Edoardo Amaldi. 2023. A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems.IEEE Transactions on Services Computing16, 3 (2023), 1630–1646. https://doi.org/10.1109/TSC.2022.3188440

  13. [13]

    Federica Filippini, Hamta Sedghani, and Danilo Ardagna. 2023. SPACE4AI- R: Runtime Management Tool for AI Applications Component Placement and Resource Selection in Computing Continua. InProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing(Taormina (Messina), Italy,)(UCC ’23). Association for Computing Machinery, New Yo...

  14. [14]

    Jordan, Philip S

    Dhawal Gupta, Yash Chandak, Scott M. Jordan, Philip S. Thomas, and Bruno Cas- tro da Silva. 2023. Behavior alignment via reward function optimization. In Proceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2297, 33 pages

  15. [15]

    A practical guide to multi-objective reinforcement learning and planning,

    Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. 2022. A practical guide to multi-objec...

  16. [16]

    Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, and Changjie Fan. 2020. Learning to utilize shaping rewards: a new approach of reward shaping. InProceedings of the 34th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Artic...

  17. [17]

    Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C

    Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Mor- cos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel

  18. [18]

    https: //doi.org/10.1126/science.aau6249

    Human-level performance in 3D multiplayer games with population- based reinforcement learning.Science364, 6443 (2019), 859–865. https: //doi.org/10.1126/science.aau6249

  19. [19]

    Abednego Wamuhindo Kambale, Hamta Sedghani, Federica Filippini, Giacomo Verticale, and Danilo Ardagna. 2026. Tabular Reinforcement Learning Methods for Artificial Intelligence Tasks Offloading in Smart Eye-Wears.ACM Trans. Auton. Adapt. Syst.21, 1, Article 3 (March 2026), 38 pages. https://doi.org/10. 1145/3771092

  20. [20]

    X. Li, L. Pan, and S. Liu. 2023. A DRL-based online VM scheduler for cost optimization in cloud brokers. InWorld Wide Web 26. 2399–2425. https://doi. org/10.1007/s11280-023-01145-3

  21. [21]

    Lewis, and Shiyin Qin

    Bingyao Liu, Satinder Singh, Richard L. Lewis, and Shiyin Qin. 2014. Optimal Rewards for Cooperative Agents.IEEE Transactions on Autonomous Mental Development6, 4 (2014), 286–297. https://doi.org/10.1109/TAMD.2014.2362682

  22. [22]

    Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, and Meng Jiang. 2025. Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting. arXiv:2509.11452 [cs.LG] https://arxiv.org/abs/2509.11452

  23. [23]

    Nima Mahmoudi and Hamzeh Khazaei. 2020. Performance Modeling of Serverless Computing Platforms.IEEE Transactions on Cloud Computing10, 4 (2020), 2834–

  24. [24]

    https://doi.org/10.1109/TCC.2020.3033373

  25. [25]

    Octavio Pappalardo, Rodrigo Ramele, and Juan Miguel Santos. 2024. Black box meta-learning intrinsic rewards for sparse-reward environments. arXiv:2407.21546 [cs.LG] https://arxiv.org/abs/2407.21546

  26. [26]

    Emanuele Petriglia, Federica Filippini, Michele Ciavotta, and Marco Savi. 2025. Multi-Agent Reinforcement Learning for Workload Distribution in FaaS-Edge Computing Systems. In2025 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW). 1128–1131. https://doi.org/10.1109/ IPDPSW66978.2025.00176

  27. [27]

    Hamta Sedghani, Federica Filippini, and Danilo Ardagna. 2024. SPACE4AI- D: A Design-Time Tool for AI Applications Resource Selection in Computing Continua.IEEE Transactions on Services Computing17, 6 (2024), 4324–4339. https://doi.org/10.1109/TSC.2024.3479935

  28. [28]

    Hassam Ullah Sheikh, Shauharda Khadka, Santiago Miret, Somdeb Majumdar, and Mariano Phielipp. 2022. Learning Intrinsic Symbolic Rewards in Reinforcement Learning. In2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892256

  29. [29]

    Li Shi, Zhemin Zhang, and Thomas Robertazzi. 2017. Energy-Aware Scheduling of Embarrassingly Parallel Jobs and Resource Allocation in Cloud.IEEE Transactions on Parallel and Distributed Systems28, 6 (2017), 1607–1620. https://doi.org/10. 1109/TPDS.2016.2625254

  30. [30]

    Lewis, Andrew G

    Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. 2010. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transactions on Autonomous Mental Development2, 2 (2010), 70–82. https: //doi.org/10.1109/TAMD.2010.2051031

  31. [31]

    Jonathan Sorg, Satinder Singh, and Richard L. Lewis. 2010. Reward design via online Gradient ascent. InProceedings of the 24th International Conference on Neural Information Processing Systems - Volume 2(Vancouver, British Columbia, Canada)(NIPS’10). Curran Associates Inc., Red Hook, NY, USA, 2190–2198

  32. [32]

    Atefeh Talebian, Alvin Valera, Jyoti Sahni, and Winston K. G. Seah. 2022. Ro- bust Intra-Slice Migration in Fog Computing. In2022 IEEE 47th Conference on Local Computer Networks (LCN). 48–55. https://doi.org/10.1109/LCN53696.2022. 9843470

  33. [33]

    Alessandro Tundo, Federica Filippini, Francesco Regonesi, Michele Ciavotta, and Marco Savi. 2025. Decentralized Edge Workload Forecasting With Gossip Learning.IEEE Transactions on Network and Service Management22, 4 (2025), 3016–3031. https://doi.org/10.1109/TNSM.2025.3570450

  34. [34]

    Pengwei Wang, Yi Li, Chao Fang, Yichen Zhong, and Zhijun Ding. 2025. Opti- mizing Serverless Performance Through Game Theory and Efficient Resource Scheduling.IEEE Trans. Comput.74, 6 (2025), 1990–2002. https://doi.org/10.1109/ TC.2025.3547158

  35. [35]

    Qian Wang, Rajan Batta, Joyendu Bhadury, and Christopher M. Rump. 2003. Budget constrained location problem with opening and closing of facilities.Com- put. Oper. Res.30, 13 (Nov. 2003), 2047–2069. https://doi.org/10.1016/S0305- 0548(02)00123-5

  36. [36]

    Zhongwen Xu, Hado van Hasselt, and David Silver. 2018. Meta-gradient rein- forcement learning. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 2402–2413

  37. [37]

    Chen Yang, Guangkai Yang, and Junge Zhang. 2023. Learning Individual Differ- ence Rewards in Multi-Agent Reinforcement Learning. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems(London, United Kingdom)(AAMAS ’23). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2418–2420

  38. [38]

    Zanussi, D

    L. Zanussi, D. Tessera, L. Massari, and M.C. Calzarossa. 2024. Workflow Sched- uling in the Cloud-Edge Continuum. InLecture Notes on Data Engineering and Communications Technologies (AINA 2024, Vol. 203), Cham Springer (Ed.). 182–190. https://doi.org/10.1007/978-3-031-57931-8_18

  39. [39]

    Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018. On learning intrinsic rewards for policy gradient methods. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 4649–4659