A Multi-Agent system for Multi-Objective constrained optimization

Federica Filippini

arxiv: 2606.20236 · v1 · pith:6MBR4ZNHnew · submitted 2026-06-18 · 💻 cs.AI · cs.LG· cs.MA

A Multi-Agent system for Multi-Objective constrained optimization

Federica Filippini This is my paper

Pith reviewed 2026-06-26 17:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords multi-agent reinforcement learningconstrained optimizationreward weightsmulti-objective optimizationdynamic environmentsLagrangian formulation

0 comments

The pith

MAMO treats reward weight selection as a separate learning problem solved by multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how reinforcement learning for cost-minimization problems with performance constraints usually requires manual choice of weights on penalty terms. These weights determine the balance between the main goal and avoiding violations, yet their best values shift in non-stationary settings. MAMO moves the weight choice into its own multi-agent learning process so that task execution and objective balancing become distinct. This separation is presented as an initial move toward RL agents that can handle constrained optimization without repeated human tuning.

Core claim

MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem solved through multi-agent RL, providing a first step toward more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

What carries the argument

MAMO, the multi-agent RL system that learns reward weights separately from the primary policy.

If this is right

Weight selection no longer requires repeated manual intervention when environments change.
The primary task policy can focus on execution while a separate process handles objective trade-offs.
RL solutions become more adaptable to shifts in the relative importance of costs versus constraints.
The approach opens a path to fully automated deployment of constrained optimizers at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar separation of concerns could apply to other hyperparameter choices that currently need manual adjustment in RL.
The method might generalize to settings where multiple objectives compete and their priorities evolve over time.
If stable, it reduces the expertise barrier for applying RL to real networking or computing systems.

Load-bearing premise

A multi-agent formulation can learn suitable reward weights in a stable and effective manner that outperforms or reliably replaces manual selection without introducing new coordination or convergence issues.

What would settle it

A controlled test in a non-stationary environment where the MAMO-learned weights produce policies with higher constraint violations or worse primary costs than carefully hand-tuned weights on the same problem.

Figures

Figures reproduced from arXiv: 2606.20236 by Federica Filippini.

**Figure 2.** Figure 2: MAMO architecture time scales and abstraction levels, treating the selection of the reward-weighting coefficients as a learning problem rather than as a fixed design choice. To this end, the framework is composed of two agents with complementary roles, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: 𝑛𝑓 and 𝑝(𝜆𝑓 , 𝑛𝑓 ) observed solving Problem (P1) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: 𝑛𝑓 and 𝑝(𝜆𝑓 , 𝑛𝑓 ) after training TE with 𝑤 = 0.99 (left) and 𝑤 = 0.1 (right); evaluation lasts 600 steps [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: 𝑝(𝜆𝑓 , 𝑛𝑓 ) while the WA agent training proceeds (above) and detailed solution (𝑛𝑓 and 𝑝(𝜆𝑓 , 𝑛𝑓 )) in a representative scenario with 𝑤 = 0.85 (middle and below) probability of rejections 𝑝 and the average costs over the last 300 steps, and receives a positive reward if 𝑝 < 0.05. Both agents are trained using RL4CC5 , an open-source library that extends Ray RLlib6 . They are implemented using Deep QLearn… view at source ↗

read the original abstract

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper floats the idea of a multi-agent RL system to learn reward weights for constrained optimization, but the abstract supplies no method, equations, or results.

read the letter

The central claim is that casting weight selection as its own multi-agent learning problem can decouple the main task policy from manual tuning of the objective, which is useful in non-stationary settings where the cost-constraint trade-off shifts. That framing correctly identifies a practical headache in Lagrangian-style constrained RL for networking and computing problems.

What the work does is restate the standard weighted-penalty approach and then name a multi-agent wrapper around the weight choice. The abstract does not claim a new theoretical result or a new algorithm family; it positions MAMO as an incremental step toward more autonomous RL pipelines.

The obvious limitation is that nothing is shown. There is no description of the individual agents, their local rewards, the joint training loop, how the learned weights are injected back into the original constrained learner, or any stability argument. Without those pieces it is impossible to judge whether the multi-agent layer actually solves the coordination or convergence issues it is meant to address or simply adds new ones. The soundness score in the reader's report is therefore accurate.

This is the sort of early sketch that might interest a small group already working on adaptive multi-objective RL, but it does not yet contain enough substance for a reading group or for referee time. The authors would need to supply at least the concrete architecture and some controlled experiments before the idea can be evaluated.

I would not send it to peer review in its present form.

Referee Report

2 major / 2 minor

Summary. The paper proposes MAMO, a multi-agent RL system that formulates the selection of reward weights (in a Lagrangian-style penalty for constrained cost-minimization problems) as a separate learning task. This is intended to decouple policy execution from manual objective design, yielding more autonomous RL solutions for non-stationary constrained optimization in computing and networking systems.

Significance. If a stable multi-agent weight learner could be shown to outperform manual tuning without introducing coordination failures or convergence problems, the approach would address a practical pain point in applying RL to constrained problems. However, the manuscript supplies no agent architecture, reward definitions, training procedure, or empirical results, so the significance cannot be assessed beyond the conceptual level.

major comments (2)

The central claim (decoupling via multi-agent weight learning) is load-bearing yet unsupported: the abstract and manuscript contain no description of the per-agent state/action spaces, the individual or joint reward signals, the mechanism that maps learned weights back to the original constrained RL task, or any stability argument. Without these elements the weakest assumption (stable, effective learning without new coordination issues) cannot be evaluated.
No experimental section, baseline comparisons, or ablation studies are present to test whether the multi-agent formulation improves autonomy or robustness over manual weight selection in non-stationary environments.

minor comments (2)

Abstract contains a typographical error: "!rst" should be "first".
Notation for the underlying constrained RL problem (costs, constraints, Lagrangian weights) is never formalized, making it difficult to understand how the multi-agent component interfaces with the original task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our conceptual proposal for MAMO. We agree that the manuscript is currently at a high-level conceptual stage and will expand it substantially in revision to address the points raised.

read point-by-point responses

Referee: The central claim (decoupling via multi-agent weight learning) is load-bearing yet unsupported: the abstract and manuscript contain no description of the per-agent state/action spaces, the individual or joint reward signals, the mechanism that maps learned weights back to the original constrained RL task, or any stability argument. Without these elements the weakest assumption (stable, effective learning without new coordination issues) cannot be evaluated.

Authors: We acknowledge that the manuscript presents the MAMO idea at a conceptual level and does not yet specify per-agent state/action spaces, reward signals, the weight-mapping mechanism to the constrained task, or stability arguments. This was a deliberate choice to focus on the high-level decoupling insight for non-stationary constrained optimization. In the revised manuscript we will add a dedicated section detailing the multi-agent formulation, including state and action spaces for the weight-selection agents, individual and joint reward definitions, the mechanism for applying learned weights to the Lagrangian penalty, and a discussion of coordination and stability considerations. revision: yes
Referee: No experimental section, baseline comparisons, or ablation studies are present to test whether the multi-agent formulation improves autonomy or robustness over manual weight selection in non-stationary environments.

Authors: The current manuscript is a short conceptual paper introducing the MAMO framework without empirical results. We agree that validation against manual tuning and other approaches is essential to assess practical benefits in non-stationary settings. In the revision we will include an experimental section with case studies from computing/networking domains, baseline comparisons (including manual weight selection and single-agent alternatives), and ablations on the multi-agent components to evaluate gains in autonomy and robustness. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; abstract-only text supplies no load-bearing steps

full rationale

The supplied document contains only the abstract, which states the MAMO idea at a conceptual level without equations, fitted parameters, self-citations, or any derivation that could reduce to its inputs. No self-definitional, fitted-input, or uniqueness claims appear. The central premise (formulating weight selection as a learning problem) is asserted but not constructed via any chain that could be inspected for circularity. Per the rules, when no formal derivation exists the finding is no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5693 in / 913 out tokens · 20005 ms · 2026-06-26T17:17:21.873718+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 21 canonical work pages

[1]

Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher
[2]

In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

Dynamic Weights in Multi-Objective Deep Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhut- dinov (Eds.). PMLR, 11–20. https://proceedings.mlr.press/v97/abels19a.html
[3]

Ardagna, G

D. Ardagna, G. Casale, M. Ciavotta, J. F. Pérez, and W. Wang. 2014. Quality- of-service in cloud computing: modeling techniques and their applications.J Internet Serv Appl5 (2014), 17. https://doi.org/10.1186/s13174-014-0011-3

work page doi:10.1186/s13174-014-0011-3 2014
[4]

M.A. Bragin. 2024. Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities.Ann Oper Res333 (2024), 29–45. https://doi.org/10.1007/s10479-023-05499-9

work page doi:10.1007/s10479-023-05499-9 2024
[5]

Valeria Cardellini, Francesco Lo Presti, Matteo Nardelli, and Gabriele Russo Russo
[6]

Kent Wenger

Decentralized self-adaptation for elastic Data Stream Processing.Future Generation Computer Systems87 (2018), 171–185. https://doi.org/10.1016/j.future. 2018.05.025

work page doi:10.1016/j.future 2018
[7]

Riccardo Cavadini, Hamta Sedghani, Federica Filippini, and Danilo Ardagna
[8]

In: European Conference on Com- puter Vision

Runtime Management of Artificial Intelligence Applications Through Hierarchical Reinforcement Learning. InPerformance Evaluation Methodologies and Tools, Marco Gribaudo, Mauro Iacono, and Sahra Sedigh Sarvestani (Eds.). Springer Nature Switzerland, Cham, 252–273. https://doi.org/10.1007/978-3- 032-06818-7_14

work page doi:10.1007/978-3-
[9]

Xi Chen, Ali Ghadirzadeh, Mårten Björkman, and Patric Jensfelt. 2019. Meta- Learning for Multi-objective Reinforcement Learning. In2019 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). 977–983. https: //doi.org/10.1109/IROS40897.2019.8968092

work page doi:10.1109/iros40897.2019.8968092 2019
[10]

Schahram Dustdar, Victor Casamayor Pujol, and Praveen Kumar Donta. 2023. On Distributed Computing Continuum Systems.IEEE Transactions on Knowledge and Data Engineering35, 4 (2023), 4092–4105. https://doi.org/10.1109/TKDE. 2022.3142856

work page doi:10.1109/tkde 2023
[11]

Federica Filippini, Jonatha Anselmi, Danilo Ardagna, and Bruno Gaujal. 2024. A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems. IEEE Transactions on Cloud Computing12, 01 (01 2024), 53–69. https://doi.org/ 10.1109/TCC.2023.3336540

work page doi:10.1109/tcc.2023.3336540 2024
[12]

Federica Filippini, Marco Lattuada, Michele Ciavotta, Arezoo Jahani, Danilo Ardagna, and Edoardo Amaldi. 2023. A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems.IEEE Transactions on Services Computing16, 3 (2023), 1630–1646. https://doi.org/10.1109/TSC.2022.3188440

work page doi:10.1109/tsc.2022.3188440 2023
[13]

Federica Filippini, Hamta Sedghani, and Danilo Ardagna. 2023. SPACE4AI- R: Runtime Management Tool for AI Applications Component Placement and Resource Selection in Computing Continua. InProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing(Taormina (Messina), Italy,)(UCC ’23). Association for Computing Machinery, New Yo...

work page doi:10.1145/3603166.3632560 2023
[14]

Jordan, Philip S

Dhawal Gupta, Yash Chandak, Scott M. Jordan, Philip S. Thomas, and Bruno Cas- tro da Silva. 2023. Behavior alignment via reward function optimization. In Proceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2297, 33 pages

2023
[15]

A practical guide to multi-objective reinforcement learning and planning,

Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. 2022. A practical guide to multi-objec...

work page doi:10.1007/s10458-022-09552-y 2022
[16]

Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, and Changjie Fan. 2020. Learning to utilize shaping rewards: a new approach of reward shaping. InProceedings of the 34th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Artic...

2020
[17]

Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Mor- cos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel
[18]

https: //doi.org/10.1126/science.aau6249

Human-level performance in 3D multiplayer games with population- based reinforcement learning.Science364, 6443 (2019), 859–865. https: //doi.org/10.1126/science.aau6249

work page doi:10.1126/science.aau6249 2019
[19]

Abednego Wamuhindo Kambale, Hamta Sedghani, Federica Filippini, Giacomo Verticale, and Danilo Ardagna. 2026. Tabular Reinforcement Learning Methods for Artificial Intelligence Tasks Offloading in Smart Eye-Wears.ACM Trans. Auton. Adapt. Syst.21, 1, Article 3 (March 2026), 38 pages. https://doi.org/10. 1145/3771092

2026
[20]

X. Li, L. Pan, and S. Liu. 2023. A DRL-based online VM scheduler for cost optimization in cloud brokers. InWorld Wide Web 26. 2399–2425. https://doi. org/10.1007/s11280-023-01145-3

work page doi:10.1007/s11280-023-01145-3 2023
[21]

Lewis, and Shiyin Qin

Bingyao Liu, Satinder Singh, Richard L. Lewis, and Shiyin Qin. 2014. Optimal Rewards for Cooperative Agents.IEEE Transactions on Autonomous Mental Development6, 4 (2014), 286–297. https://doi.org/10.1109/TAMD.2014.2362682

work page doi:10.1109/tamd.2014.2362682 2014
[22]

Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, and Meng Jiang. 2025. Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting. arXiv:2509.11452 [cs.LG] https://arxiv.org/abs/2509.11452

arXiv 2025
[23]

Nima Mahmoudi and Hamzeh Khazaei. 2020. Performance Modeling of Serverless Computing Platforms.IEEE Transactions on Cloud Computing10, 4 (2020), 2834–

2020
[24]

https://doi.org/10.1109/TCC.2020.3033373

work page doi:10.1109/tcc.2020.3033373 2020
[25]

Octavio Pappalardo, Rodrigo Ramele, and Juan Miguel Santos. 2024. Black box meta-learning intrinsic rewards for sparse-reward environments. arXiv:2407.21546 [cs.LG] https://arxiv.org/abs/2407.21546

arXiv 2024
[26]

Emanuele Petriglia, Federica Filippini, Michele Ciavotta, and Marco Savi. 2025. Multi-Agent Reinforcement Learning for Workload Distribution in FaaS-Edge Computing Systems. In2025 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW). 1128–1131. https://doi.org/10.1109/ IPDPSW66978.2025.00176

arXiv 2025
[27]

Hamta Sedghani, Federica Filippini, and Danilo Ardagna. 2024. SPACE4AI- D: A Design-Time Tool for AI Applications Resource Selection in Computing Continua.IEEE Transactions on Services Computing17, 6 (2024), 4324–4339. https://doi.org/10.1109/TSC.2024.3479935

work page doi:10.1109/tsc.2024.3479935 2024
[28]

Hassam Ullah Sheikh, Shauharda Khadka, Santiago Miret, Somdeb Majumdar, and Mariano Phielipp. 2022. Learning Intrinsic Symbolic Rewards in Reinforcement Learning. In2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892256

work page doi:10.1109/ijcnn55064.2022.9892256 2022
[29]

Li Shi, Zhemin Zhang, and Thomas Robertazzi. 2017. Energy-Aware Scheduling of Embarrassingly Parallel Jobs and Resource Allocation in Cloud.IEEE Transactions on Parallel and Distributed Systems28, 6 (2017), 1607–1620. https://doi.org/10. 1109/TPDS.2016.2625254

arXiv 2017
[30]

Lewis, Andrew G

Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. 2010. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transactions on Autonomous Mental Development2, 2 (2010), 70–82. https: //doi.org/10.1109/TAMD.2010.2051031

work page doi:10.1109/tamd.2010.2051031 2010
[31]

Jonathan Sorg, Satinder Singh, and Richard L. Lewis. 2010. Reward design via online Gradient ascent. InProceedings of the 24th International Conference on Neural Information Processing Systems - Volume 2(Vancouver, British Columbia, Canada)(NIPS’10). Curran Associates Inc., Red Hook, NY, USA, 2190–2198

2010
[32]

Atefeh Talebian, Alvin Valera, Jyoti Sahni, and Winston K. G. Seah. 2022. Ro- bust Intra-Slice Migration in Fog Computing. In2022 IEEE 47th Conference on Local Computer Networks (LCN). 48–55. https://doi.org/10.1109/LCN53696.2022. 9843470

work page doi:10.1109/lcn53696.2022 2022
[33]

Alessandro Tundo, Federica Filippini, Francesco Regonesi, Michele Ciavotta, and Marco Savi. 2025. Decentralized Edge Workload Forecasting With Gossip Learning.IEEE Transactions on Network and Service Management22, 4 (2025), 3016–3031. https://doi.org/10.1109/TNSM.2025.3570450

work page doi:10.1109/tnsm.2025.3570450 2025
[34]

Pengwei Wang, Yi Li, Chao Fang, Yichen Zhong, and Zhijun Ding. 2025. Opti- mizing Serverless Performance Through Game Theory and Efficient Resource Scheduling.IEEE Trans. Comput.74, 6 (2025), 1990–2002. https://doi.org/10.1109/ TC.2025.3547158

arXiv 2025
[35]

Qian Wang, Rajan Batta, Joyendu Bhadury, and Christopher M. Rump. 2003. Budget constrained location problem with opening and closing of facilities.Com- put. Oper. Res.30, 13 (Nov. 2003), 2047–2069. https://doi.org/10.1016/S0305- 0548(02)00123-5

work page doi:10.1016/s0305- 2003
[36]

Zhongwen Xu, Hado van Hasselt, and David Silver. 2018. Meta-gradient rein- forcement learning. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 2402–2413

2018
[37]

Chen Yang, Guangkai Yang, and Junge Zhang. 2023. Learning Individual Differ- ence Rewards in Multi-Agent Reinforcement Learning. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems(London, United Kingdom)(AAMAS ’23). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2418–2420

2023
[38]

Zanussi, D

L. Zanussi, D. Tessera, L. Massari, and M.C. Calzarossa. 2024. Workflow Sched- uling in the Cloud-Edge Continuum. InLecture Notes on Data Engineering and Communications Technologies (AINA 2024, Vol. 203), Cham Springer (Ed.). 182–190. https://doi.org/10.1007/978-3-031-57931-8_18

work page doi:10.1007/978-3-031-57931-8_18 2024
[39]

Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018. On learning intrinsic rewards for policy gradient methods. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 4649–4659

2018

[1] [1]

Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher

[2] [2]

In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

Dynamic Weights in Multi-Objective Deep Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhut- dinov (Eds.). PMLR, 11–20. https://proceedings.mlr.press/v97/abels19a.html

[3] [3]

Ardagna, G

D. Ardagna, G. Casale, M. Ciavotta, J. F. Pérez, and W. Wang. 2014. Quality- of-service in cloud computing: modeling techniques and their applications.J Internet Serv Appl5 (2014), 17. https://doi.org/10.1186/s13174-014-0011-3

work page doi:10.1186/s13174-014-0011-3 2014

[4] [4]

M.A. Bragin. 2024. Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities.Ann Oper Res333 (2024), 29–45. https://doi.org/10.1007/s10479-023-05499-9

work page doi:10.1007/s10479-023-05499-9 2024

[5] [5]

Valeria Cardellini, Francesco Lo Presti, Matteo Nardelli, and Gabriele Russo Russo

[6] [6]

Kent Wenger

Decentralized self-adaptation for elastic Data Stream Processing.Future Generation Computer Systems87 (2018), 171–185. https://doi.org/10.1016/j.future. 2018.05.025

work page doi:10.1016/j.future 2018

[7] [7]

Riccardo Cavadini, Hamta Sedghani, Federica Filippini, and Danilo Ardagna

[8] [8]

In: European Conference on Com- puter Vision

Runtime Management of Artificial Intelligence Applications Through Hierarchical Reinforcement Learning. InPerformance Evaluation Methodologies and Tools, Marco Gribaudo, Mauro Iacono, and Sahra Sedigh Sarvestani (Eds.). Springer Nature Switzerland, Cham, 252–273. https://doi.org/10.1007/978-3- 032-06818-7_14

work page doi:10.1007/978-3-

[9] [9]

Xi Chen, Ali Ghadirzadeh, Mårten Björkman, and Patric Jensfelt. 2019. Meta- Learning for Multi-objective Reinforcement Learning. In2019 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). 977–983. https: //doi.org/10.1109/IROS40897.2019.8968092

work page doi:10.1109/iros40897.2019.8968092 2019

[10] [10]

Schahram Dustdar, Victor Casamayor Pujol, and Praveen Kumar Donta. 2023. On Distributed Computing Continuum Systems.IEEE Transactions on Knowledge and Data Engineering35, 4 (2023), 4092–4105. https://doi.org/10.1109/TKDE. 2022.3142856

work page doi:10.1109/tkde 2023

[11] [11]

Federica Filippini, Jonatha Anselmi, Danilo Ardagna, and Bruno Gaujal. 2024. A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems. IEEE Transactions on Cloud Computing12, 01 (01 2024), 53–69. https://doi.org/ 10.1109/TCC.2023.3336540

work page doi:10.1109/tcc.2023.3336540 2024

[12] [12]

Federica Filippini, Marco Lattuada, Michele Ciavotta, Arezoo Jahani, Danilo Ardagna, and Edoardo Amaldi. 2023. A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems.IEEE Transactions on Services Computing16, 3 (2023), 1630–1646. https://doi.org/10.1109/TSC.2022.3188440

work page doi:10.1109/tsc.2022.3188440 2023

[13] [13]

Federica Filippini, Hamta Sedghani, and Danilo Ardagna. 2023. SPACE4AI- R: Runtime Management Tool for AI Applications Component Placement and Resource Selection in Computing Continua. InProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing(Taormina (Messina), Italy,)(UCC ’23). Association for Computing Machinery, New Yo...

work page doi:10.1145/3603166.3632560 2023

[14] [14]

Jordan, Philip S

Dhawal Gupta, Yash Chandak, Scott M. Jordan, Philip S. Thomas, and Bruno Cas- tro da Silva. 2023. Behavior alignment via reward function optimization. In Proceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2297, 33 pages

2023

[15] [15]

A practical guide to multi-objective reinforcement learning and planning,

Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. 2022. A practical guide to multi-objec...

work page doi:10.1007/s10458-022-09552-y 2022

[16] [16]

Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, and Changjie Fan. 2020. Learning to utilize shaping rewards: a new approach of reward shaping. InProceedings of the 34th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Artic...

2020

[17] [17]

Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Mor- cos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel

[18] [18]

https: //doi.org/10.1126/science.aau6249

Human-level performance in 3D multiplayer games with population- based reinforcement learning.Science364, 6443 (2019), 859–865. https: //doi.org/10.1126/science.aau6249

work page doi:10.1126/science.aau6249 2019

[19] [19]

Abednego Wamuhindo Kambale, Hamta Sedghani, Federica Filippini, Giacomo Verticale, and Danilo Ardagna. 2026. Tabular Reinforcement Learning Methods for Artificial Intelligence Tasks Offloading in Smart Eye-Wears.ACM Trans. Auton. Adapt. Syst.21, 1, Article 3 (March 2026), 38 pages. https://doi.org/10. 1145/3771092

2026

[20] [20]

X. Li, L. Pan, and S. Liu. 2023. A DRL-based online VM scheduler for cost optimization in cloud brokers. InWorld Wide Web 26. 2399–2425. https://doi. org/10.1007/s11280-023-01145-3

work page doi:10.1007/s11280-023-01145-3 2023

[21] [21]

Lewis, and Shiyin Qin

Bingyao Liu, Satinder Singh, Richard L. Lewis, and Shiyin Qin. 2014. Optimal Rewards for Cooperative Agents.IEEE Transactions on Autonomous Mental Development6, 4 (2014), 286–297. https://doi.org/10.1109/TAMD.2014.2362682

work page doi:10.1109/tamd.2014.2362682 2014

[22] [22]

Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, and Meng Jiang. 2025. Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting. arXiv:2509.11452 [cs.LG] https://arxiv.org/abs/2509.11452

arXiv 2025

[23] [23]

Nima Mahmoudi and Hamzeh Khazaei. 2020. Performance Modeling of Serverless Computing Platforms.IEEE Transactions on Cloud Computing10, 4 (2020), 2834–

2020

[24] [24]

https://doi.org/10.1109/TCC.2020.3033373

work page doi:10.1109/tcc.2020.3033373 2020

[25] [25]

Octavio Pappalardo, Rodrigo Ramele, and Juan Miguel Santos. 2024. Black box meta-learning intrinsic rewards for sparse-reward environments. arXiv:2407.21546 [cs.LG] https://arxiv.org/abs/2407.21546

arXiv 2024

[26] [26]

Emanuele Petriglia, Federica Filippini, Michele Ciavotta, and Marco Savi. 2025. Multi-Agent Reinforcement Learning for Workload Distribution in FaaS-Edge Computing Systems. In2025 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW). 1128–1131. https://doi.org/10.1109/ IPDPSW66978.2025.00176

arXiv 2025

[27] [27]

Hamta Sedghani, Federica Filippini, and Danilo Ardagna. 2024. SPACE4AI- D: A Design-Time Tool for AI Applications Resource Selection in Computing Continua.IEEE Transactions on Services Computing17, 6 (2024), 4324–4339. https://doi.org/10.1109/TSC.2024.3479935

work page doi:10.1109/tsc.2024.3479935 2024

[28] [28]

Hassam Ullah Sheikh, Shauharda Khadka, Santiago Miret, Somdeb Majumdar, and Mariano Phielipp. 2022. Learning Intrinsic Symbolic Rewards in Reinforcement Learning. In2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892256

work page doi:10.1109/ijcnn55064.2022.9892256 2022

[29] [29]

Li Shi, Zhemin Zhang, and Thomas Robertazzi. 2017. Energy-Aware Scheduling of Embarrassingly Parallel Jobs and Resource Allocation in Cloud.IEEE Transactions on Parallel and Distributed Systems28, 6 (2017), 1607–1620. https://doi.org/10. 1109/TPDS.2016.2625254

arXiv 2017

[30] [30]

Lewis, Andrew G

Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. 2010. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transactions on Autonomous Mental Development2, 2 (2010), 70–82. https: //doi.org/10.1109/TAMD.2010.2051031

work page doi:10.1109/tamd.2010.2051031 2010

[31] [31]

Jonathan Sorg, Satinder Singh, and Richard L. Lewis. 2010. Reward design via online Gradient ascent. InProceedings of the 24th International Conference on Neural Information Processing Systems - Volume 2(Vancouver, British Columbia, Canada)(NIPS’10). Curran Associates Inc., Red Hook, NY, USA, 2190–2198

2010

[32] [32]

Atefeh Talebian, Alvin Valera, Jyoti Sahni, and Winston K. G. Seah. 2022. Ro- bust Intra-Slice Migration in Fog Computing. In2022 IEEE 47th Conference on Local Computer Networks (LCN). 48–55. https://doi.org/10.1109/LCN53696.2022. 9843470

work page doi:10.1109/lcn53696.2022 2022

[33] [33]

Alessandro Tundo, Federica Filippini, Francesco Regonesi, Michele Ciavotta, and Marco Savi. 2025. Decentralized Edge Workload Forecasting With Gossip Learning.IEEE Transactions on Network and Service Management22, 4 (2025), 3016–3031. https://doi.org/10.1109/TNSM.2025.3570450

work page doi:10.1109/tnsm.2025.3570450 2025

[34] [34]

Pengwei Wang, Yi Li, Chao Fang, Yichen Zhong, and Zhijun Ding. 2025. Opti- mizing Serverless Performance Through Game Theory and Efficient Resource Scheduling.IEEE Trans. Comput.74, 6 (2025), 1990–2002. https://doi.org/10.1109/ TC.2025.3547158

arXiv 2025

[35] [35]

Qian Wang, Rajan Batta, Joyendu Bhadury, and Christopher M. Rump. 2003. Budget constrained location problem with opening and closing of facilities.Com- put. Oper. Res.30, 13 (Nov. 2003), 2047–2069. https://doi.org/10.1016/S0305- 0548(02)00123-5

work page doi:10.1016/s0305- 2003

[36] [36]

Zhongwen Xu, Hado van Hasselt, and David Silver. 2018. Meta-gradient rein- forcement learning. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 2402–2413

2018

[37] [37]

Chen Yang, Guangkai Yang, and Junge Zhang. 2023. Learning Individual Differ- ence Rewards in Multi-Agent Reinforcement Learning. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems(London, United Kingdom)(AAMAS ’23). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2418–2420

2023

[38] [38]

Zanussi, D

L. Zanussi, D. Tessera, L. Massari, and M.C. Calzarossa. 2024. Workflow Sched- uling in the Cloud-Edge Continuum. InLecture Notes on Data Engineering and Communications Technologies (AINA 2024, Vol. 203), Cham Springer (Ed.). 182–190. https://doi.org/10.1007/978-3-031-57931-8_18

work page doi:10.1007/978-3-031-57931-8_18 2024

[39] [39]

Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018. On learning intrinsic rewards for policy gradient methods. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 4649–4659

2018