pith. machine review for the scientific record. sign in

arxiv: 2604.25222 · v1 · submitted 2026-04-28 · 💻 cs.DC

Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions

Pith reviewed 2026-05-07 15:22 UTC · model grok-4.3

classification 💻 cs.DC
keywords microservicesadaptive managementtaxonomydynamic environmentscloud computingautoscalingevaluation methods
0
0 comments X

The pith

Microservice adaptation systems typically model only part of real production dynamics, and their reported gains track evaluation fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how microservice systems handle changing workloads, network conditions, interference, and failures through adaptive management. It organizes the literature with a taxonomy that examines where control decisions are made, which dynamics are explicitly modeled, what adaptation tactics are applied, and how thoroughly systems are evaluated. Reviewing 84 systems and 13 evaluation studies reveals that most approaches capture dynamics incompletely. This matters for practitioners because partial modeling can leave applications vulnerable to unpredictable conditions that occur in actual deployments. The survey also identifies open problems such as coordinating adaptations across layers and creating reproducible ways to test behavior under realistic change.

Core claim

A taxonomy organized around control locus, modeled dynamics, adaptation strategy, and evaluation evidence shows that most of the 84 examined microservice systems only partially represent production dynamics such as workload variation, request-path changes, interference, and failures. Reported performance improvements also vary with how faithfully the evaluation environments reproduce those dynamics.

What carries the argument

Four-dimensional taxonomy of control locus, modeled dynamics, adaptation strategy, and evaluation evidence, applied to 84 systems and 13 evaluation artifacts.

If this is right

  • Cross-layer coordination between autoscaling, placement, routing, and remediation will be required for robust adaptation.
  • Abstractions that connect telemetry directly to control actions can reduce the gap between monitoring and response.
  • Learning-based controllers must add safety constraints to avoid harmful adaptations during exploration.
  • Evaluation frameworks need to become reproducible and dynamic so that claims can be compared fairly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current reported gains may shrink once systems face the full combination of dynamics seen in live cloud environments.
  • The taxonomy could serve as a checklist for designers of future orchestration platforms to avoid common oversights.
  • Adding explicit failure-mode coverage might cut outage frequency more than scaling alone.

Load-bearing premise

The 84 chosen systems and 13 evaluation artifacts stand in for the full range of existing work, and the four taxonomy dimensions together cover the important ways adaptive management can be organized.

What would settle it

A follow-up survey that locates many additional production microservice systems modeling all major dynamics simultaneously and showing performance gains that remain stable across low- and high-fidelity evaluations.

Figures

Figures reproduced from arXiv: 2604.25222 by Maria Rodriguez Read, Ming Chen, Muhammed Tawfiqul Islam, Rajkumar Buyya.

Figure 1
Figure 1. Figure 1: Overview illustration of four common origins of microservice dynamics: demand-side variation, application and configuration view at source ↗
Figure 2
Figure 2. Figure 2: The proposed taxonomy for dynamics-aware microservice management. D1 records control locus; D2 records modeled view at source ↗
read the original abstract

Microservice-based cloud applications face changing workloads, evolving request paths, variable network conditions, interference, and failures. These dynamics couple autoscaling, placement, routing, isolation, and remediation. The survey examines dynamics-aware adaptive management for microservices. Its taxonomy covers control locus, modeled dynamics, adaptation strategy, and evaluation evidence; objectives and telemetry are cross-cutting. A synthesis of 84 system entries and 13 evaluation artifacts shows that production dynamics are often partially modeled. Reported gains also depend on evaluation fidelity. Key future directions include cross-layer coordination, telemetry-to-control abstractions, safe learning-based control, and reproducible dynamic evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper surveys adaptive management of microservices facing dynamic conditions including workload changes, network variability, interference, and failures. It proposes a taxonomy with four primary dimensions (control locus, modeled dynamics, adaptation strategy, evaluation evidence) plus cross-cutting objectives and telemetry. A synthesis of 84 systems and 13 evaluation artifacts leads to the observations that production dynamics are typically modeled only partially and that reported gains are sensitive to evaluation fidelity. The work identifies future directions such as cross-layer coordination, telemetry-to-control abstractions, safe learning-based control, and reproducible dynamic evaluation.

Significance. If the synthesis holds, the taxonomy supplies a practical organizing framework for a rapidly evolving area of cloud and distributed systems research. The concrete counts (84 systems, 13 artifacts) and the emphasis on partial dynamic modeling plus evaluation fidelity provide actionable insights that can steer both academic work and industrial practice toward more realistic adaptation mechanisms. The call for reproducible dynamic evaluation is a constructive contribution that addresses a known weakness in the broader literature.

major comments (1)
  1. [Methodology / Literature Selection] The methodology section does not specify the literature search strategy, databases, keywords, or inclusion/exclusion criteria used to arrive at the 84 system entries and 13 evaluation artifacts. Without these details the representativeness of the synthesis and the risk of selection bias cannot be assessed, which directly affects the reliability of the central claims about partial modeling of dynamics and dependence on evaluation fidelity.
minor comments (2)
  1. The abstract and introduction could more explicitly state the search period or publication venues covered to help readers gauge temporal coverage.
  2. [Synthesis] A summary table or figure that cross-tabulates the 84 systems against the four taxonomy dimensions would improve readability and allow quicker verification of the reported patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work and the constructive comment on methodology transparency. We address the point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methodology / Literature Selection] The methodology section does not specify the literature search strategy, databases, keywords, or inclusion/exclusion criteria used to arrive at the 84 system entries and 13 evaluation artifacts. Without these details the representativeness of the synthesis and the risk of selection bias cannot be assessed, which directly affects the reliability of the central claims about partial modeling of dynamics and dependence on evaluation fidelity.

    Authors: We agree that explicit documentation of the literature selection process is necessary to allow readers to evaluate representativeness and potential selection bias. In the revised manuscript we will add a dedicated subsection to the Methodology section that details: the databases and sources searched (ACM Digital Library, IEEE Xplore, Google Scholar, arXiv, and selected conference proceedings); the keyword combinations and Boolean queries used (e.g., “microservices” AND (“adaptive management” OR “dynamic adaptation” OR “autoscaling” OR “workload variability” OR “network variability”)); the time window (2015–2024); inclusion criteria (peer-reviewed papers presenting implemented adaptive systems for microservices that explicitly address at least one form of runtime dynamics); and exclusion criteria (non-English works, purely theoretical papers without implementation or evaluation, prior surveys, and papers focused solely on static environments). We will also describe the multi-stage screening process (title/abstract screening followed by full-text review) that produced the final counts of 84 systems and 13 evaluation artifacts. These additions will directly support the reliability of the central observations on partial dynamic modeling and evaluation fidelity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey is observational

full rationale

This is a literature survey paper whose core contribution is a taxonomy (control locus, modeled dynamics, adaptation strategy, evaluation evidence) applied to 84 external systems and 13 evaluation artifacts. No equations, derivations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described structure. All reported patterns and future directions are direct summaries of cited external work rather than internally generated quantities that reduce to the paper's own inputs. Self-citations, if present, are not load-bearing for any deductive claim. The synthesis is therefore self-contained against external benchmarks with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper; the central claims rest on classification of existing work rather than new parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5405 in / 997 out tokens · 46682 ms · 2026-05-07T15:22:41.339935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

172 extracted references · 124 canonical work pages

  1. [1]

    Al Qassem, Thanos Stouraitis, Ernesto Damiani, and Ibrahim M

    Lamees M. Al Qassem, Thanos Stouraitis, Ernesto Damiani, and Ibrahim M. Elfadel. 2024. Containerized Microservices: A Survey of Resource Management Frameworks.IEEE Transactions on Network and Service Management21, 4 (2024), 3775–3796. doi:10.1109/TNSM.2024.3388633 Manuscript submitted to ACM 26 Chen, Islam, Read, and Buyya

  2. [2]

    Alibaba. 2022. Alibaba microservice distributed traces. https://github.com/alibaba/clusterdata/tree/master/cluster-trace-microservices-v2022. Accessed: 2026-04-10

  3. [3]

    Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: minimal near-optimal datacenter transport. InProceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM(Hong Kong, China)(SIGCOMM ’13). Association for Computing Machinery, New York, NY, USA, 435–446. doi:10.1145/2486001.2486031

  4. [4]

    Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. 2016. Automating Failure Testing Research at Internet Scale. InProceedings of the Seventh ACM Symposium on Cloud Computing(Santa Clara, CA, USA)(SoCC ’16). Association for Computing Machinery, New York, NY, USA, 17–28. doi:10.1145/2987550.2987555

  5. [5]

    Apache Software Foundation. [n. d.]. Apache JMeter. https://jmeter.apache.org/. Accessed: 2026-04-10

  6. [6]

    Kubernetes Autoscaler. 2026. Cluster Autoscaler. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler. Accessed: 2026-04-10

  7. [7]

    Alkiviadis Aznavouridis, Konstantinos Tsakos, and Euripides GM Petrakis. 2022. Micro-service placement policies for cost optimization in Kubernetes. InInternational Conference on Advanced Information Networking and Applications. Springer, Springer, 409–420

  8. [8]

    Ataollah Fatahi Baarzi and George Kesidis. 2021. SHOWAR: Right-Sizing And Efficient Scheduling of Microservices. InProceedings of the ACM Symposium on Cloud Computing(Seattle, WA, USA)(SoCC ’21). Association for Computing Machinery, New York, NY, USA, 427–441. doi:10.1145/3472883.3486999

  9. [9]

    Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online Job Scheduling in Distributed Machine Learning Clusters. InIEEE INFOCOM 2018 - IEEE Conference on Computer Communications(Honolulu, HI, USA). IEEE Press, 495–503. doi:10.1109/INFOCOM.2018.8486422

  10. [10]

    Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering.IEEE Software33, 3 (2016), 35–41. doi:10.1109/MS.2016.60

  11. [11]

    Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Adithya Abraham Philip. 2018. Orca: Differential Bug Localization in Large-Scale Services. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 493–509

  12. [12]

    Jing Bi, Libo Zhang, Haitao Yuan, and MengChu Zhou. 2018. Hybrid task prediction based on wavelet decomposition and ARIMA model in cloud data center. In2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC). 1–6. doi:10.1109/ICNSC.2018.8361342

  13. [13]

    Zhengda Bian, Shenggui Li, Wei Wang, and Yang You. 2021. Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(St. Louis, Missouri)(SC ’21). ACM, New York, NY, USA, Article 100, 15 pages. doi:10.1145...

  14. [14]

    Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. 2016. Borg, Omega, and Kubernetes.Commun. ACM59, 5 (April 2016), 50–57. doi:10.1145/2890784

  15. [15]

    Calheiros, Enayat Masoumi, Rajiv Ranjan, and Rajkumar Buyya

    Rodrigo N. Calheiros, Enayat Masoumi, Rajiv Ranjan, and Rajkumar Buyya. 2015. Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS.IEEE Transactions on Cloud Computing3, 4 (2015), 449–458. doi:10.1109/TCC.2014.2350475

  16. [16]

    Calheiros, Rajiv Ranjan, Anton Beloglazov, César A

    Rodrigo N. Calheiros, Rajiv Ranjan, Anton Beloglazov, César A. F. De Rose, and Rajkumar Buyya. 2011. CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms.Softw. Pract. Exper.41, 1 (Jan. 2011), 23–50. doi:10.1002/spe.995

  17. [17]

    Lianjie Cao and Puneet Sharma. 2021. Co-locating Containerized Workloads Using Service Mesh Telemetry. InProceedings of the 17th ACM International Conference on Emerging Networking Experiments and Technologies (CoNEXT). 168–181. doi:10.1145/3485983.3494867

  18. [18]

    Carmen Carrión. 2022. Kubernetes Scheduling: Taxonomy, Ongoing Issues and Challenges.ACM Comput. Surv.55, 7, Article 138 (Dec. 2022), 37 pages. doi:10.1145/3539606

  19. [19]

    Donahoo, and Michal Trnka

    Tomas Cerny, Michael J. Donahoo, and Michal Trnka. 2018. Contextual Understanding of Microservice Architecture: Current and Future Directions. SIGAPP Applied Computing Review17, 4 (2018), 29–45. doi:10.1145/3183628.3183631

  20. [20]

    Liao Chen, Shutian Luo, Chenyu Lin, Zizhao Mo, Huanle Xu, Kejiang Ye, and Chengzhong Xu. 2024. Derm: SLA-aware Resource Management for Highly Dynamic Microservices. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 424–436. doi:10.1109/ ISCA59077.2024.00039

  21. [21]

    Ming Chen, Muhammed Tawfiqul Islam, Maria Rodriguez Read, and Rajkumar Buyya. 2025. iDynamics: A Configurable Emulation Framework for Evaluating Microservice Scheduling Policies under Controllable Cloud-Edge Dynamics. arXiv:2503.16029 [cs.DC] https://arxiv.org/abs/2503.16029

  22. [22]

    Ming Chen, Muhammed Tawfiqul Islam, Maria Rodriguez Read, and Rajkumar Buyya. 2026. TraDE: Network and Traffic-Aware Adaptive Scheduling for Microservices Under Dynamics .IEEE Transactions on Parallel & Distributed Systems37, 01 (Jan. 2026), 76–89. doi:10.1109/TPDS.2025.3626424

  23. [23]

    Ming Chen, Maria Rodriguez Read, Patricia Arroba, and Rajkumar Buyya. 2023. EN-Beats: A Novel Ensemble Learning-Based Method for Multiple Resource Predictions in Cloud. In2023 IEEE 16th International Conference on Cloud Computing (CLOUD). 144–154. doi:10.1109/CLOUD60044.2023. 00025

  24. [24]

    Martínez

    Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems(Providence, RI, USA)(ASPLOS ’19). ACM, New York, NY, USA, 107–120. doi:10.1145/32978...

  25. [25]

    Martínez

    Shuang Chen, Yi Jiang, Christina Delimitrou, and José F. Martínez. 2022. PIMCloud: QoS-Aware Resource Management of Latency-Critical Applications in Clouds with Processing-in-Memory. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 1086–1099. doi:10.1109/HPCA53966.2022.00083 Manuscript submitted to ACM Adaptive Microse...

  26. [26]

    Yitian Chen, Yanfei Kang, Yixiong Chen, and Zizhuo Wang. 2020. Probabilistic forecasting with temporal convolutional neural network.Neurocom- puting399 (2020), 491–501. doi:10.1016/j.neucom.2020.03.011

  27. [27]

    Arora, Yu Deng, Saurabh Jha, and Tianyin Xu

    Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan R. Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. 2025. STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS’25). https://openreview.net/forum?id=fYW1PKawwJ

  28. [28]

    Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOpsLab: A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud. InMLSys ’25. https://www.microsoft. com/en-us/research/publication/aiopslab-a-holistic-framework-for-evaluating-ai-agents-for...

  29. [29]

    Dazhao Cheng, Xiaobo Zhou, Zhijun Ding, Yu Wang, and Mike Ji. 2019. Heterogeneity Aware Workload Management in Distributed Sustainable Datacenters.IEEE Transactions on Parallel and Distributed Systems (TPDS)30, 2 (Feb 2019), 375–387. doi:10.1109/TPDS.2018.2865927

  30. [30]

    Byungkwon Choi, Jinwoo Park, Chunghan Lee, and Dongsu Han. 2021. PHPA: A Proactive Autoscaling Framework for Microservice Chain. In Proceedings of the 5th Asia-Pacific Workshop on Networking. ACM, 65–71. doi:10.1145/3469393.3469401

  31. [31]

    Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China)(SOSP ’17). ACM, New York, NY, USA, 153–167. doi:10...

  32. [32]

    Mario De Jesus, Perfect Sylvester, William Clifford, Aaron Perez, and Palden Lama. 2025. LLM-Based Multi-Agent Framework for Troubleshooting Distributed Systems. InProceedings of the 2025 IEEE Cloud Summit. IEEE, 110–115. doi:10.1109/CLOUD-SUMMIT64795.2025.00024

  33. [33]

    Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang

    Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures. InProceedings of the ACM Symposium on Cloud Computing(Seattle, WA, USA)(SOCC ’14). ACM, New York, NY, USA, 1–13. doi:10.1145/2670979.2670987

  34. [34]

    Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale.Commun. ACM56, 2 (Feb. 2013), 74–80. doi:10.1145/2408776.2408794

  35. [35]

    Christina Delimitrou. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. InProceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems(Houston, Texas, USA)(ASPLOS ’13). ACM, New York, NY, USA, 77–88. doi:10.1145/2451116.2451125

  36. [36]

    Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Efficient and QoS-Aware Cluster Management. InProceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems(Salt Lake City, Utah, USA)(ASPLOS ’14). ACM, New York, NY, USA, 127–144. doi:10.1145/2541940.2541941

  37. [37]

    2023.µBench: An Open-Source Factory of Benchmark Microservice Applications.IEEE Transactions on Parallel and Distributed Systems34, 3 (2023), 968–980

    Andrea Detti, Ludovico Funari, and Luca Petrucci. 2023.µBench: An Open-Source Factory of Benchmark Microservice Applications.IEEE Transactions on Parallel and Distributed Systems34, 3 (2023), 968–980. doi:10.1109/TPDS.2023.3236447

  38. [38]

    Zhijun Ding, Song Wang, and Changjun Jiang. 2023. Kubernetes-Oriented Microservice Placement With Dynamic Resource Allocation .IEEE Transactions on Cloud Computing11, 02 (April 2023), 1777–1793. doi:10.1109/TCC.2022.3161900

  39. [39]

    Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: Yesterday, Today, and Tomorrow. InPresent and Ulterior Software Engineering, Manuel Mazzara and Bertrand Meyer (Eds.). Springer, 195–216. doi:10.1007/978-3-319-67425-4_12

  40. [40]

    Fanrong Du, Jiuchen Shi, Quan Chen, Pu Pang, Li Li, and Minyi Guo. 2025. Generating Microservice Graphs with Production Characteristics for Efficient Resource Scaling. InProceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). Association for Computing Machinery, New York, NY, USA, 895–910. doi:10.1145/3721145.3725761

  41. [41]

    Raphael Eidenbenz, Yvonne-Anne Pignolet, and Alain Ryser. 2020. Latency-Aware Industrial Fog Application Orchestration with Kubernetes. In 2020 Fifth International Conference on Fog and Mobile Edge Computing (FMEC). 164–171. doi:10.1109/FMEC49853.2020.9144934

  42. [42]

    Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein

    Danielle E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. 2016. Maglev: a fast and reliable software network load balancer. InProceedings of the 13th Usenix Conference on Networked Systems Design and Implementation(Santa Clara, CA)(NSDI’16). USE...

  43. [43]

    Simon Eismann, Joel Scheuner, Erwin Van Eyk, Maximilian Schwinger, Johannes Grohmann, Nikolas Herbst, Cristina L Abad, and Alexandru Iosup

  44. [44]

    A review of serverless use cases and their characteristics.arXiv preprint arXiv:2008.11110(2020)

  45. [45]

    Envoy: An Open Source Edge and Service Proxy, Designed for Cloud Native Apps. 2026. https://www.envoyproxy.io/. Accessed: 2026-04-10

  46. [46]

    Katz, and Scott Shenker

    Rodrigo Fonseca, George Porter, Randy H. Katz, and Scott Shenker. 2007. X-Trace: A Pervasive Network Tracing Framework. In4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07). USENIX Association, Cambridge, MA. https://www.usenix.org/conference/ nsdi-07/x-trace-pervasive-network-tracing-framework

  47. [47]

    Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, and Minyi Guo. 2021. Adaptive resource efficient microservice deployment in cloud-edge continuum. IEEE Transactions on Parallel and Distributed Systems33, 8 (2021), 1825–1840

  48. [48]

    Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, Xin Peng, Wenli Zheng, and Minyi Guo. 2021. QoS-Aware and Resource Efficient Microservice Deployment in Cloud-Edge Continuum. InProceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 932–941. doi:10.1109/IPDPS49936.2021.00102

  49. [49]

    Nan Fu, Guang Cheng, Yue Teng, Guangye Dai, Shui Yu, and Zihan Chen. 2025. Intelligent Root Cause Localization in MicroService Systems: A Survey and New Perspectives.ACM Comput. Surv.57, 12, Article 325 (July 2025), 37 pages. doi:10.1145/3736755 Manuscript submitted to ACM 28 Chen, Islam, Read, and Buyya

  50. [50]

    Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA)(ASPLOS ’21). ACM, New York, NY, USA, 135–151. doi:10.1145/3...

  51. [51]

    Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-S...

  52. [52]

    Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. InProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems(Providence, RI, US...

  53. [53]

    Mohammad Goudarzi, Marimuthu Palaniswami, and Rajkumar Buyya. 2022. Scheduling IoT Applications in Edge and Fog Computing Environments: A Taxonomy and Future Directions.ACM Comput. Surv.55, 7, Article 152 (Dec. 2022), 41 pages. doi:10.1145/3544836

  54. [54]

    Paulo Gouveia, João Neves, Carlos Segarra, Luca Liechti, Shady Issa, Valerio Schiavoni, and Miguel Matos. 2020. Kollaps: Decentralized and Dynamic Topology Emulation. InProceedings of the Fifteenth European Conference on Computer Systems(Heraklion, Greece)(EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 23, 16 pages. doi:10.1...

  55. [55]

    Lin Gu, Deze Zeng, Jie Hu, Hai Jin, Song Guo, and Albert Y. Zomaya. 2021. Exploring Layered Container Structure for Cost Efficient Microservice Deployment. InProceedings of the IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 1–9. doi:10.1109/INFOCOM42981.2021. 9488918

  56. [56]

    Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2023. STEAM: Observability- Preserving Trace Sampling. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing ...

  57. [57]

    Joseph, Randy Katz, Scott Shenker, and Ion Stoica

    Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. InProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation(Boston, MA)(NSDI’11). USENIX Association, USA, 295–308

  58. [58]

    Vijaykumar

    Chi-Yao Hong, Matthew Caesar, and P. Brighten Godfrey. 2012. Finishing Flows Quickly with Preemptive Scheduling. InProceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication(Helsinki, Finland)(SIGCOMM ’12). ACM, New York, NY, USA, 127–138. doi:10.1145/2342356.2342389

  59. [59]

    Yi Hu, Haonan Ding, Haoxuan Chen, Jianwen He, Menglan Hu, Chao Cai, and Kai Peng. 2025. Collaborative Orchestration with Probabilistic Routing for Dynamic Service Mesh in Clouds. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

  60. [60]

    Yi Hu, Hao Wang, Liangyuan Wang, Menglan Hu, Kai Peng, and Bharadwaj Veeravalli. 2023. Joint Deployment and Request Routing for Microservice Call Graphs in Data Centers.IEEE Transactions on Parallel and Distributed Systems34, 11 (2023), 2994–3011. doi:10.1109/TPDS.2023.3311767

  61. [61]

    Lexiang Huang and Timothy Zhu. 2021. tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces. InProceedings of the ACM Symposium on Cloud Computing(Seattle, WA, USA)(SoCC ’21). Association for Computing Machinery, New York, NY, USA, 76–91. doi:10.1145/3472883.3486994

  62. [62]

    Sambasivan

    Darby Huye, Yuri Shkuro, and Raja R. Sambasivan. 2023. Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 419–432

  63. [63]

    Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. InProceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 519–532

  64. [64]

    Istio: An open source service mesh. 2026. https://istio.io/. Accessed: 2026-04-10

  65. [65]

    Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazières, Balaji Prabhakar, Albert Greenberg, and Changhoon Kim. 2013. EyeQ: Practical Network Performance Isolation at the Edge. In10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX Association, Lombard, IL, 297–311

  66. [66]

    Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas C

    Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Jae-woo...

  67. [67]

    Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks. InProceedings of the Fourteenth EuroSys Conference 2019(Dresden, Germany)(EuroSys ’19). ACM, New York, NY, USA, Article 34, 16 pages. doi:10.1145/3302424.3303958 Manuscript s...

  68. [68]

    Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. 2015. Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters. In2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 485–497

  69. [69]

    Karpenter. 2026. Karpenter. https://karpenter.sh/. Accessed: 2026-04-10

  70. [70]

    Bartolini, Nathan Beckmann, and Daniel Sanchez

    Harshad Kasture, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 598–610. doi:10.1145/2830772.2830797

  71. [71]

    KEDA. 2026. KEDA: Kubernetes Event-driven Autoscaling. https://keda.sh/. Accessed: 2026-04-10

  72. [72]

    Kephart and David M

    Jeffrey O. Kephart and David M. Chess. 2003. The Vision of Autonomic Computing.Computer36, 1 (2003), 41–50. doi:10.1109/MC.2003.1160055

  73. [73]

    Tahseen Khan, Wenhong Tian, Guangyao Zhou, Shashikant Ilager, Mingming Gong, and Rajkumar Buyya. 2022. Machine learning (ML)-centric resource management in cloud computing: A review and future directions.Journal of Network and Computer Applications204 (2022), 103405. doi:10.1016/j.jnca.2022.103405

  74. [74]

    In Kee Kim, Wei Wang, Yanjun Qi, and Marty Humphrey. 2022. Forecasting Cloud Application Workloads With CloudInsight for Predictive Resource Management.IEEE Transactions on Cloud Computing10, 3 (2022), 1848–1863. doi:10.1109/TCC.2020.2998017

  75. [75]

    Knative. 2026. Knative. https://knative.dev/. Accessed: 2026-04-10

  76. [76]

    Kubernetes Authors. 2026. Kubernetes Horizontal Pod Autoscaling. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/. Accessed: 2026-04-10

  77. [77]

    Kubernetes Authors. 2026. Kubernetes Vertical Pod Autoscaling. https://kubernetes.io/docs/concepts/workloads/autoscaling/vertical-pod-autoscale/. Accessed: 2026-04-10

  78. [78]

    Jitendra Kumar, Rimsha Goomer, and Ashutosh Kumar Singh. 2018. Long Short Term Memory Recurrent Neural Network (LSTM-RNN) Based Workload Forecasting Model For Cloud Datacenters.Procedia Computer Science125 (2018), 676–682. doi:10.1016/j.procs.2017.12.087 The 6th International Conference on Smart Computing and Communications

  79. [79]

    Jitendra Kumar and Ashutosh Kumar Singh. 2018. Workload prediction in cloud using artificial neural network and adaptive differential evolution. Future Generation Computer Systems81 (2018), 41–52. doi:10.1016/j.future.2017.10.047

  80. [80]

    Grafana Labs. 2026. Grafana. https://grafana.com/oss/grafana/. Accessed: 2026-04-10

Showing first 80 references.