arxiv: 2604.25222 · v1 · submitted 2026-04-28 · 💻 cs.DC

Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions

Ming Chen , Muhammed Tawfiqul Islam , Maria Rodriguez Read , Rajkumar Buyya This is my paper

Pith reviewed 2026-05-07 15:22 UTC · model grok-4.3

classification 💻 cs.DC

keywords microservicesadaptive managementtaxonomydynamic environmentscloud computingautoscalingevaluation methods

0 comments

The pith

Microservice adaptation systems typically model only part of real production dynamics, and their reported gains track evaluation fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how microservice systems handle changing workloads, network conditions, interference, and failures through adaptive management. It organizes the literature with a taxonomy that examines where control decisions are made, which dynamics are explicitly modeled, what adaptation tactics are applied, and how thoroughly systems are evaluated. Reviewing 84 systems and 13 evaluation studies reveals that most approaches capture dynamics incompletely. This matters for practitioners because partial modeling can leave applications vulnerable to unpredictable conditions that occur in actual deployments. The survey also identifies open problems such as coordinating adaptations across layers and creating reproducible ways to test behavior under realistic change.

Core claim

A taxonomy organized around control locus, modeled dynamics, adaptation strategy, and evaluation evidence shows that most of the 84 examined microservice systems only partially represent production dynamics such as workload variation, request-path changes, interference, and failures. Reported performance improvements also vary with how faithfully the evaluation environments reproduce those dynamics.

What carries the argument

Four-dimensional taxonomy of control locus, modeled dynamics, adaptation strategy, and evaluation evidence, applied to 84 systems and 13 evaluation artifacts.

If this is right

Cross-layer coordination between autoscaling, placement, routing, and remediation will be required for robust adaptation.
Abstractions that connect telemetry directly to control actions can reduce the gap between monitoring and response.
Learning-based controllers must add safety constraints to avoid harmful adaptations during exploration.
Evaluation frameworks need to become reproducible and dynamic so that claims can be compared fairly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current reported gains may shrink once systems face the full combination of dynamics seen in live cloud environments.
The taxonomy could serve as a checklist for designers of future orchestration platforms to avoid common oversights.
Adding explicit failure-mode coverage might cut outage frequency more than scaling alone.

Load-bearing premise

The 84 chosen systems and 13 evaluation artifacts stand in for the full range of existing work, and the four taxonomy dimensions together cover the important ways adaptive management can be organized.

What would settle it

A follow-up survey that locates many additional production microservice systems modeling all major dynamics simultaneously and showing performance gains that remain stable across low- and high-fidelity evaluations.

Figures

Figures reproduced from arXiv: 2604.25222 by Maria Rodriguez Read, Ming Chen, Muhammed Tawfiqul Islam, Rajkumar Buyya.

**Figure 1.** Figure 1: Overview illustration of four common origins of microservice dynamics: demand-side variation, application and configuration view at source ↗

**Figure 2.** Figure 2: The proposed taxonomy for dynamics-aware microservice management. D1 records control locus; D2 records modeled view at source ↗

read the original abstract

Microservice-based cloud applications face changing workloads, evolving request paths, variable network conditions, interference, and failures. These dynamics couple autoscaling, placement, routing, isolation, and remediation. The survey examines dynamics-aware adaptive management for microservices. Its taxonomy covers control locus, modeled dynamics, adaptation strategy, and evaluation evidence; objectives and telemetry are cross-cutting. A synthesis of 84 system entries and 13 evaluation artifacts shows that production dynamics are often partially modeled. Reported gains also depend on evaluation fidelity. Key future directions include cross-layer coordination, telemetry-to-control abstractions, safe learning-based control, and reproducible dynamic evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward survey that organizes adaptive microservice work into a four-part taxonomy and flags how often real dynamics get only partial treatment.

read the letter

The main thing to know is that the paper reviews 84 systems and 13 evaluation artifacts to build a taxonomy around control locus, modeled dynamics, adaptation strategy, and evaluation evidence. It concludes that production dynamics are frequently modeled only partially and that claimed gains track closely with how realistic the evaluation setup is. That synthesis is the new piece; prior surveys on microservices or autoscaling do not appear to have pulled these threads together in exactly this way. The future directions on cross-layer coordination, better telemetry abstractions, safe learning control, and reproducible dynamic testing follow logically from the gaps they identify. The paper does a clean job laying out how workload changes, interference, and failures interact with placement, routing, and remediation. For someone who needs a map of the space, this structure is practical. The soft spots are the usual ones for a survey. The abstract gives no explicit selection criteria or search protocol for the 84 entries, so representativeness is hard to judge from outside. If the sample leans toward certain venues or omits key industrial systems, the partial-modeling claim could shift. That is not a load-bearing flaw, but it is the part reviewers will press on. No derivations or fitted models are involved, so there is no circularity or hidden parameter issue. This paper is for distributed-systems researchers and cloud practitioners who want an organized view of adaptive techniques rather than a new algorithm. It is not foundational, but the taxonomy and the evaluation-fidelity observation are concrete enough to be worth citing when framing new work. I would send it to peer review. The organization adds value even if the selection process needs tightening in revision.

Referee Report

1 major / 2 minor

Summary. The paper surveys adaptive management of microservices facing dynamic conditions including workload changes, network variability, interference, and failures. It proposes a taxonomy with four primary dimensions (control locus, modeled dynamics, adaptation strategy, evaluation evidence) plus cross-cutting objectives and telemetry. A synthesis of 84 systems and 13 evaluation artifacts leads to the observations that production dynamics are typically modeled only partially and that reported gains are sensitive to evaluation fidelity. The work identifies future directions such as cross-layer coordination, telemetry-to-control abstractions, safe learning-based control, and reproducible dynamic evaluation.

Significance. If the synthesis holds, the taxonomy supplies a practical organizing framework for a rapidly evolving area of cloud and distributed systems research. The concrete counts (84 systems, 13 artifacts) and the emphasis on partial dynamic modeling plus evaluation fidelity provide actionable insights that can steer both academic work and industrial practice toward more realistic adaptation mechanisms. The call for reproducible dynamic evaluation is a constructive contribution that addresses a known weakness in the broader literature.

major comments (1)

[Methodology / Literature Selection] The methodology section does not specify the literature search strategy, databases, keywords, or inclusion/exclusion criteria used to arrive at the 84 system entries and 13 evaluation artifacts. Without these details the representativeness of the synthesis and the risk of selection bias cannot be assessed, which directly affects the reliability of the central claims about partial modeling of dynamics and dependence on evaluation fidelity.

minor comments (2)

The abstract and introduction could more explicitly state the search period or publication venues covered to help readers gauge temporal coverage.
[Synthesis] A summary table or figure that cross-tabulates the 84 systems against the four taxonomy dimensions would improve readability and allow quicker verification of the reported patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work and the constructive comment on methodology transparency. We address the point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methodology / Literature Selection] The methodology section does not specify the literature search strategy, databases, keywords, or inclusion/exclusion criteria used to arrive at the 84 system entries and 13 evaluation artifacts. Without these details the representativeness of the synthesis and the risk of selection bias cannot be assessed, which directly affects the reliability of the central claims about partial modeling of dynamics and dependence on evaluation fidelity.

Authors: We agree that explicit documentation of the literature selection process is necessary to allow readers to evaluate representativeness and potential selection bias. In the revised manuscript we will add a dedicated subsection to the Methodology section that details: the databases and sources searched (ACM Digital Library, IEEE Xplore, Google Scholar, arXiv, and selected conference proceedings); the keyword combinations and Boolean queries used (e.g., “microservices” AND (“adaptive management” OR “dynamic adaptation” OR “autoscaling” OR “workload variability” OR “network variability”)); the time window (2015–2024); inclusion criteria (peer-reviewed papers presenting implemented adaptive systems for microservices that explicitly address at least one form of runtime dynamics); and exclusion criteria (non-English works, purely theoretical papers without implementation or evaluation, prior surveys, and papers focused solely on static environments). We will also describe the multi-stage screening process (title/abstract screening followed by full-text review) that produced the final counts of 84 systems and 13 evaluation artifacts. These additions will directly support the reliability of the central observations on partial dynamic modeling and evaluation fidelity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey is observational

full rationale

This is a literature survey paper whose core contribution is a taxonomy (control locus, modeled dynamics, adaptation strategy, evaluation evidence) applied to 84 external systems and 13 evaluation artifacts. No equations, derivations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described structure. All reported patterns and future directions are direct summaries of cited external work rather than internally generated quantities that reduce to the paper's own inputs. Self-citations, if present, are not load-bearing for any deductive claim. The synthesis is therefore self-contained against external benchmarks with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper; the central claims rest on classification of existing work rather than new parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5405 in / 997 out tokens · 46682 ms · 2026-05-07T15:22:41.339935+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

172 extracted references · 124 canonical work pages

[1]

Al Qassem, Thanos Stouraitis, Ernesto Damiani, and Ibrahim M

Lamees M. Al Qassem, Thanos Stouraitis, Ernesto Damiani, and Ibrahim M. Elfadel. 2024. Containerized Microservices: A Survey of Resource Management Frameworks.IEEE Transactions on Network and Service Management21, 4 (2024), 3775–3796. doi:10.1109/TNSM.2024.3388633 Manuscript submitted to ACM 26 Chen, Islam, Read, and Buyya

work page doi:10.1109/tnsm.2024.3388633 2024
[2]

Alibaba. 2022. Alibaba microservice distributed traces. https://github.com/alibaba/clusterdata/tree/master/cluster-trace-microservices-v2022. Accessed: 2026-04-10

2022
[3]

Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: minimal near-optimal datacenter transport. InProceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM(Hong Kong, China)(SIGCOMM ’13). Association for Computing Machinery, New York, NY, USA, 435–446. doi:10.1145/2486001.2486031

work page doi:10.1145/2486001.2486031 2013
[4]

Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. 2016. Automating Failure Testing Research at Internet Scale. InProceedings of the Seventh ACM Symposium on Cloud Computing(Santa Clara, CA, USA)(SoCC ’16). Association for Computing Machinery, New York, NY, USA, 17–28. doi:10.1145/2987550.2987555

work page doi:10.1145/2987550.2987555 2016
[5]

Apache Software Foundation. [n. d.]. Apache JMeter. https://jmeter.apache.org/. Accessed: 2026-04-10

2026
[6]

Kubernetes Autoscaler. 2026. Cluster Autoscaler. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler. Accessed: 2026-04-10

2026
[7]

Alkiviadis Aznavouridis, Konstantinos Tsakos, and Euripides GM Petrakis. 2022. Micro-service placement policies for cost optimization in Kubernetes. InInternational Conference on Advanced Information Networking and Applications. Springer, Springer, 409–420

2022
[8]

Ataollah Fatahi Baarzi and George Kesidis. 2021. SHOWAR: Right-Sizing And Efficient Scheduling of Microservices. InProceedings of the ACM Symposium on Cloud Computing(Seattle, WA, USA)(SoCC ’21). Association for Computing Machinery, New York, NY, USA, 427–441. doi:10.1145/3472883.3486999

work page doi:10.1145/3472883.3486999 2021
[9]

Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online Job Scheduling in Distributed Machine Learning Clusters. InIEEE INFOCOM 2018 - IEEE Conference on Computer Communications(Honolulu, HI, USA). IEEE Press, 495–503. doi:10.1109/INFOCOM.2018.8486422

work page doi:10.1109/infocom.2018.8486422 2018
[10]

Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering.IEEE Software33, 3 (2016), 35–41. doi:10.1109/MS.2016.60

work page doi:10.1109/ms.2016.60 2016
[11]

Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Adithya Abraham Philip. 2018. Orca: Differential Bug Localization in Large-Scale Services. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 493–509

2018
[12]

Jing Bi, Libo Zhang, Haitao Yuan, and MengChu Zhou. 2018. Hybrid task prediction based on wavelet decomposition and ARIMA model in cloud data center. In2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC). 1–6. doi:10.1109/ICNSC.2018.8361342

work page doi:10.1109/icnsc.2018.8361342 2018
[13]

Zhengda Bian, Shenggui Li, Wei Wang, and Yang You. 2021. Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(St. Louis, Missouri)(SC ’21). ACM, New York, NY, USA, Article 100, 15 pages. doi:10.1145...

work page doi:10.1145/3458817.3480859 2021
[14]

Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. 2016. Borg, Omega, and Kubernetes.Commun. ACM59, 5 (April 2016), 50–57. doi:10.1145/2890784

work page doi:10.1145/2890784 2016
[15]

Calheiros, Enayat Masoumi, Rajiv Ranjan, and Rajkumar Buyya

Rodrigo N. Calheiros, Enayat Masoumi, Rajiv Ranjan, and Rajkumar Buyya. 2015. Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS.IEEE Transactions on Cloud Computing3, 4 (2015), 449–458. doi:10.1109/TCC.2014.2350475

work page doi:10.1109/tcc.2014.2350475 2015
[16]

Calheiros, Rajiv Ranjan, Anton Beloglazov, César A

Rodrigo N. Calheiros, Rajiv Ranjan, Anton Beloglazov, César A. F. De Rose, and Rajkumar Buyya. 2011. CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms.Softw. Pract. Exper.41, 1 (Jan. 2011), 23–50. doi:10.1002/spe.995

work page doi:10.1002/spe.995 2011
[17]

Lianjie Cao and Puneet Sharma. 2021. Co-locating Containerized Workloads Using Service Mesh Telemetry. InProceedings of the 17th ACM International Conference on Emerging Networking Experiments and Technologies (CoNEXT). 168–181. doi:10.1145/3485983.3494867

work page doi:10.1145/3485983.3494867 2021
[18]

Carmen Carrión. 2022. Kubernetes Scheduling: Taxonomy, Ongoing Issues and Challenges.ACM Comput. Surv.55, 7, Article 138 (Dec. 2022), 37 pages. doi:10.1145/3539606

work page doi:10.1145/3539606 2022
[19]

Donahoo, and Michal Trnka

Tomas Cerny, Michael J. Donahoo, and Michal Trnka. 2018. Contextual Understanding of Microservice Architecture: Current and Future Directions. SIGAPP Applied Computing Review17, 4 (2018), 29–45. doi:10.1145/3183628.3183631

work page doi:10.1145/3183628.3183631 2018
[20]

Liao Chen, Shutian Luo, Chenyu Lin, Zizhao Mo, Huanle Xu, Kejiang Ye, and Chengzhong Xu. 2024. Derm: SLA-aware Resource Management for Highly Dynamic Microservices. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 424–436. doi:10.1109/ ISCA59077.2024.00039

work page arXiv 2024
[21]

Ming Chen, Muhammed Tawfiqul Islam, Maria Rodriguez Read, and Rajkumar Buyya. 2025. iDynamics: A Configurable Emulation Framework for Evaluating Microservice Scheduling Policies under Controllable Cloud-Edge Dynamics. arXiv:2503.16029 [cs.DC] https://arxiv.org/abs/2503.16029

work page arXiv 2025
[22]

Ming Chen, Muhammed Tawfiqul Islam, Maria Rodriguez Read, and Rajkumar Buyya. 2026. TraDE: Network and Traffic-Aware Adaptive Scheduling for Microservices Under Dynamics .IEEE Transactions on Parallel & Distributed Systems37, 01 (Jan. 2026), 76–89. doi:10.1109/TPDS.2025.3626424

work page doi:10.1109/tpds.2025.3626424 2026
[23]

Ming Chen, Maria Rodriguez Read, Patricia Arroba, and Rajkumar Buyya. 2023. EN-Beats: A Novel Ensemble Learning-Based Method for Multiple Resource Predictions in Cloud. In2023 IEEE 16th International Conference on Cloud Computing (CLOUD). 144–154. doi:10.1109/CLOUD60044.2023. 00025

work page doi:10.1109/cloud60044.2023 2023
[24]

Martínez

Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems(Providence, RI, USA)(ASPLOS ’19). ACM, New York, NY, USA, 107–120. doi:10.1145/32978...

work page doi:10.1145/3297858.3304005 2019
[25]

Martínez

Shuang Chen, Yi Jiang, Christina Delimitrou, and José F. Martínez. 2022. PIMCloud: QoS-Aware Resource Management of Latency-Critical Applications in Clouds with Processing-in-Memory. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 1086–1099. doi:10.1109/HPCA53966.2022.00083 Manuscript submitted to ACM Adaptive Microse...

work page doi:10.1109/hpca53966.2022.00083 2022
[26]

Yitian Chen, Yanfei Kang, Yixiong Chen, and Zizhuo Wang. 2020. Probabilistic forecasting with temporal convolutional neural network.Neurocom- puting399 (2020), 491–501. doi:10.1016/j.neucom.2020.03.011

work page doi:10.1016/j.neucom.2020.03.011 2020
[27]

Arora, Yu Deng, Saurabh Jha, and Tianyin Xu

Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan R. Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. 2025. STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS’25). https://openreview.net/forum?id=fYW1PKawwJ

2025
[28]

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOpsLab: A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud. InMLSys ’25. https://www.microsoft. com/en-us/research/publication/aiopslab-a-holistic-framework-for-evaluating-ai-agents-for...

2025
[29]

Dazhao Cheng, Xiaobo Zhou, Zhijun Ding, Yu Wang, and Mike Ji. 2019. Heterogeneity Aware Workload Management in Distributed Sustainable Datacenters.IEEE Transactions on Parallel and Distributed Systems (TPDS)30, 2 (Feb 2019), 375–387. doi:10.1109/TPDS.2018.2865927

work page doi:10.1109/tpds.2018.2865927 2019
[30]

Byungkwon Choi, Jinwoo Park, Chunghan Lee, and Dongsu Han. 2021. PHPA: A Proactive Autoscaling Framework for Microservice Chain. In Proceedings of the 5th Asia-Pacific Workshop on Networking. ACM, 65–71. doi:10.1145/3469393.3469401

work page doi:10.1145/3469393.3469401 2021
[31]

Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China)(SOSP ’17). ACM, New York, NY, USA, 153–167. doi:10...

work page doi:10.1145/3132747.3132772 2017
[32]

Mario De Jesus, Perfect Sylvester, William Clifford, Aaron Perez, and Palden Lama. 2025. LLM-Based Multi-Agent Framework for Troubleshooting Distributed Systems. InProceedings of the 2025 IEEE Cloud Summit. IEEE, 110–115. doi:10.1109/CLOUD-SUMMIT64795.2025.00024

work page doi:10.1109/cloud-summit64795.2025.00024 2025
[33]

Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang

Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures. InProceedings of the ACM Symposium on Cloud Computing(Seattle, WA, USA)(SOCC ’14). ACM, New York, NY, USA, 1–13. doi:10.1145/2670979.2670987

work page doi:10.1145/2670979.2670987 2014
[34]

Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale.Commun. ACM56, 2 (Feb. 2013), 74–80. doi:10.1145/2408776.2408794

work page doi:10.1145/2408776.2408794 2013
[35]

Christina Delimitrou. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. InProceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems(Houston, Texas, USA)(ASPLOS ’13). ACM, New York, NY, USA, 77–88. doi:10.1145/2451116.2451125

work page doi:10.1145/2451116.2451125 2013
[36]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Efficient and QoS-Aware Cluster Management. InProceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems(Salt Lake City, Utah, USA)(ASPLOS ’14). ACM, New York, NY, USA, 127–144. doi:10.1145/2541940.2541941

work page doi:10.1145/2541940.2541941 2014
[37]

2023.µBench: An Open-Source Factory of Benchmark Microservice Applications.IEEE Transactions on Parallel and Distributed Systems34, 3 (2023), 968–980

Andrea Detti, Ludovico Funari, and Luca Petrucci. 2023.µBench: An Open-Source Factory of Benchmark Microservice Applications.IEEE Transactions on Parallel and Distributed Systems34, 3 (2023), 968–980. doi:10.1109/TPDS.2023.3236447

work page doi:10.1109/tpds.2023.3236447 2023
[38]

Zhijun Ding, Song Wang, and Changjun Jiang. 2023. Kubernetes-Oriented Microservice Placement With Dynamic Resource Allocation .IEEE Transactions on Cloud Computing11, 02 (April 2023), 1777–1793. doi:10.1109/TCC.2022.3161900

work page doi:10.1109/tcc.2022.3161900 2023
[39]

Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: Yesterday, Today, and Tomorrow. InPresent and Ulterior Software Engineering, Manuel Mazzara and Bertrand Meyer (Eds.). Springer, 195–216. doi:10.1007/978-3-319-67425-4_12

work page doi:10.1007/978-3-319-67425-4_12 2017
[40]

Fanrong Du, Jiuchen Shi, Quan Chen, Pu Pang, Li Li, and Minyi Guo. 2025. Generating Microservice Graphs with Production Characteristics for Efficient Resource Scaling. InProceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). Association for Computing Machinery, New York, NY, USA, 895–910. doi:10.1145/3721145.3725761

work page doi:10.1145/3721145.3725761 2025
[41]

Raphael Eidenbenz, Yvonne-Anne Pignolet, and Alain Ryser. 2020. Latency-Aware Industrial Fog Application Orchestration with Kubernetes. In 2020 Fifth International Conference on Fog and Mobile Edge Computing (FMEC). 164–171. doi:10.1109/FMEC49853.2020.9144934

work page doi:10.1109/fmec49853.2020.9144934 2020
[42]

Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein

Danielle E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. 2016. Maglev: a fast and reliable software network load balancer. InProceedings of the 13th Usenix Conference on Networked Systems Design and Implementation(Santa Clara, CA)(NSDI’16). USE...

2016
[43]

Simon Eismann, Joel Scheuner, Erwin Van Eyk, Maximilian Schwinger, Johannes Grohmann, Nikolas Herbst, Cristina L Abad, and Alexandru Iosup
[44]

A review of serverless use cases and their characteristics.arXiv preprint arXiv:2008.11110(2020)

work page arXiv 2008
[45]

Envoy: An Open Source Edge and Service Proxy, Designed for Cloud Native Apps. 2026. https://www.envoyproxy.io/. Accessed: 2026-04-10

2026
[46]

Katz, and Scott Shenker

Rodrigo Fonseca, George Porter, Randy H. Katz, and Scott Shenker. 2007. X-Trace: A Pervasive Network Tracing Framework. In4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07). USENIX Association, Cambridge, MA. https://www.usenix.org/conference/ nsdi-07/x-trace-pervasive-network-tracing-framework

2007
[47]

Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, and Minyi Guo. 2021. Adaptive resource efficient microservice deployment in cloud-edge continuum. IEEE Transactions on Parallel and Distributed Systems33, 8 (2021), 1825–1840

2021
[48]

Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, Xin Peng, Wenli Zheng, and Minyi Guo. 2021. QoS-Aware and Resource Efficient Microservice Deployment in Cloud-Edge Continuum. InProceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 932–941. doi:10.1109/IPDPS49936.2021.00102

work page doi:10.1109/ipdps49936.2021.00102 2021
[49]

Nan Fu, Guang Cheng, Yue Teng, Guangye Dai, Shui Yu, and Zihan Chen. 2025. Intelligent Root Cause Localization in MicroService Systems: A Survey and New Perspectives.ACM Comput. Surv.57, 12, Article 325 (July 2025), 37 pages. doi:10.1145/3736755 Manuscript submitted to ACM 28 Chen, Islam, Read, and Buyya

work page doi:10.1145/3736755 2025
[50]

Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA)(ASPLOS ’21). ACM, New York, NY, USA, 135–151. doi:10.1145/3...

work page doi:10.1145/3445814.3446700 2021
[51]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-S...

work page doi:10.1145/3297858.3304013 2019
[52]

Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. InProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems(Providence, RI, US...

work page doi:10.1145/3297858.3304004 2019
[53]

Mohammad Goudarzi, Marimuthu Palaniswami, and Rajkumar Buyya. 2022. Scheduling IoT Applications in Edge and Fog Computing Environments: A Taxonomy and Future Directions.ACM Comput. Surv.55, 7, Article 152 (Dec. 2022), 41 pages. doi:10.1145/3544836

work page doi:10.1145/3544836 2022
[54]

Paulo Gouveia, João Neves, Carlos Segarra, Luca Liechti, Shady Issa, Valerio Schiavoni, and Miguel Matos. 2020. Kollaps: Decentralized and Dynamic Topology Emulation. InProceedings of the Fifteenth European Conference on Computer Systems(Heraklion, Greece)(EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 23, 16 pages. doi:10.1...

work page doi:10.1145/3342195.3387540 2020
[55]

Lin Gu, Deze Zeng, Jie Hu, Hai Jin, Song Guo, and Albert Y. Zomaya. 2021. Exploring Layered Container Structure for Cost Efficient Microservice Deployment. InProceedings of the IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 1–9. doi:10.1109/INFOCOM42981.2021. 9488918

work page doi:10.1109/infocom42981.2021 2021
[56]

Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2023. STEAM: Observability- Preserving Trace Sampling. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing ...

work page doi:10.1145/3611643.3613881 2023
[57]

Joseph, Randy Katz, Scott Shenker, and Ion Stoica

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. InProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation(Boston, MA)(NSDI’11). USENIX Association, USA, 295–308

2011
[58]

Vijaykumar

Chi-Yao Hong, Matthew Caesar, and P. Brighten Godfrey. 2012. Finishing Flows Quickly with Preemptive Scheduling. InProceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication(Helsinki, Finland)(SIGCOMM ’12). ACM, New York, NY, USA, 127–138. doi:10.1145/2342356.2342389

work page doi:10.1145/2342356.2342389 2012
[59]

Yi Hu, Haonan Ding, Haoxuan Chen, Jianwen He, Menglan Hu, Chao Cai, and Kai Peng. 2025. Collaborative Orchestration with Probabilistic Routing for Dynamic Service Mesh in Clouds. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

2025
[60]

Yi Hu, Hao Wang, Liangyuan Wang, Menglan Hu, Kai Peng, and Bharadwaj Veeravalli. 2023. Joint Deployment and Request Routing for Microservice Call Graphs in Data Centers.IEEE Transactions on Parallel and Distributed Systems34, 11 (2023), 2994–3011. doi:10.1109/TPDS.2023.3311767

work page doi:10.1109/tpds.2023.3311767 2023
[61]

Lexiang Huang and Timothy Zhu. 2021. tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces. InProceedings of the ACM Symposium on Cloud Computing(Seattle, WA, USA)(SoCC ’21). Association for Computing Machinery, New York, NY, USA, 76–91. doi:10.1145/3472883.3486994

work page doi:10.1145/3472883.3486994 2021
[62]

Sambasivan

Darby Huye, Yuri Shkuro, and Raja R. Sambasivan. 2023. Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 419–432

2023
[63]

Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. InProceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 519–532

2018
[64]

Istio: An open source service mesh. 2026. https://istio.io/. Accessed: 2026-04-10

2026
[65]

Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazières, Balaji Prabhakar, Albert Greenberg, and Changhoon Kim. 2013. EyeQ: Practical Network Performance Isolation at the Edge. In10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX Association, Lombard, IL, 297–311

2013
[66]

Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas C

Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Jae-woo...

2025
[67]

Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks. InProceedings of the Fourteenth EuroSys Conference 2019(Dresden, Germany)(EuroSys ’19). ACM, New York, NY, USA, Article 34, 16 pages. doi:10.1145/3302424.3303958 Manuscript s...

work page doi:10.1145/3302424.3303958 2019
[68]

Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. 2015. Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters. In2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 485–497

2015
[69]

Karpenter. 2026. Karpenter. https://karpenter.sh/. Accessed: 2026-04-10

2026
[70]

Bartolini, Nathan Beckmann, and Daniel Sanchez

Harshad Kasture, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 598–610. doi:10.1145/2830772.2830797

work page doi:10.1145/2830772.2830797 2015
[71]

KEDA. 2026. KEDA: Kubernetes Event-driven Autoscaling. https://keda.sh/. Accessed: 2026-04-10

2026
[72]

Kephart and David M

Jeffrey O. Kephart and David M. Chess. 2003. The Vision of Autonomic Computing.Computer36, 1 (2003), 41–50. doi:10.1109/MC.2003.1160055

work page doi:10.1109/mc.2003.1160055 2003
[73]

Tahseen Khan, Wenhong Tian, Guangyao Zhou, Shashikant Ilager, Mingming Gong, and Rajkumar Buyya. 2022. Machine learning (ML)-centric resource management in cloud computing: A review and future directions.Journal of Network and Computer Applications204 (2022), 103405. doi:10.1016/j.jnca.2022.103405

work page doi:10.1016/j.jnca.2022.103405 2022
[74]

In Kee Kim, Wei Wang, Yanjun Qi, and Marty Humphrey. 2022. Forecasting Cloud Application Workloads With CloudInsight for Predictive Resource Management.IEEE Transactions on Cloud Computing10, 3 (2022), 1848–1863. doi:10.1109/TCC.2020.2998017

work page doi:10.1109/tcc.2020.2998017 2022
[75]

Knative. 2026. Knative. https://knative.dev/. Accessed: 2026-04-10

2026
[76]

Kubernetes Authors. 2026. Kubernetes Horizontal Pod Autoscaling. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/. Accessed: 2026-04-10

2026
[77]

Kubernetes Authors. 2026. Kubernetes Vertical Pod Autoscaling. https://kubernetes.io/docs/concepts/workloads/autoscaling/vertical-pod-autoscale/. Accessed: 2026-04-10

2026
[78]

Jitendra Kumar, Rimsha Goomer, and Ashutosh Kumar Singh. 2018. Long Short Term Memory Recurrent Neural Network (LSTM-RNN) Based Workload Forecasting Model For Cloud Datacenters.Procedia Computer Science125 (2018), 676–682. doi:10.1016/j.procs.2017.12.087 The 6th International Conference on Smart Computing and Communications

work page doi:10.1016/j.procs.2017.12.087 2018
[79]

Jitendra Kumar and Ashutosh Kumar Singh. 2018. Workload prediction in cloud using artificial neural network and adaptive differential evolution. Future Generation Computer Systems81 (2018), 41–52. doi:10.1016/j.future.2017.10.047

work page doi:10.1016/j.future.2017.10.047 2018
[80]

Grafana Labs. 2026. Grafana. https://grafana.com/oss/grafana/. Accessed: 2026-04-10

2026

Showing first 80 references.