UModel: An Agent-Ready Observability Data Modeling Method at Scale
Pith reviewed 2026-06-28 05:22 UTC · model grok-4.3
The pith
UModel turns fragmented observability data into objects linked by semantic graphs so agents can trace root causes across system topologies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs; a companion pipeline-based query interface called U-SPL then lets agents autonomously explore topologies and correlate multimodal data, producing an 8 percent increase in root cause localization precision on the AIOps 2025 Challenge dataset.
What carries the argument
The object-centric ontological layer that converts telemetry and entities into standardized objects joined by semantic graphs, together with the U-SPL query interface that supports autonomous topology exploration.
If this is right
- Agents gain the ability to traverse system topologies and correlate data without custom integration code for each data source.
- Downstream RCA accuracy rises when the same agents operate on data organized under the unified object model.
- The modeling supports production workloads at millions of operations per second with sub-second query responses.
- Heterogeneous observability sources become interchangeable once expressed as objects in the semantic graph.
Where Pith is reading between the lines
- The same object-graph approach could be applied to other agent tasks that require correlating logs, metrics, and traces across distributed systems.
- Standardized semantic graphs might reduce the engineering cost of onboarding new observability tools or migrating between vendors.
- If the modeling proves portable, it could serve as a shared substrate for multiple independent RCA agents rather than each maintaining its own data mappings.
Load-bearing premise
The measured 8 percent precision gain arises from the shift to object-centric modeling and semantic graphs rather than from differences in agents, experimental setup, or other unstated factors.
What would settle it
Re-executing the root cause localization task on the identical AIOps 2025 Challenge dataset using the original non-UModel data organization while holding the agents and all other experimental conditions fixed, then observing whether the precision remains unchanged.
Figures
read the original abstract
When networked system failures occur, automatically performing Root Cause Analysis (RCA) using observability data is critical for ensuring networked system reliability. Recently, LLM-based agents have shown promise for automating this diagnosis process through advanced reasoning and autonomous exploration. However, existing observability frameworks remain archaic, characterized by fragmented data silos, incompatible schemas, and insufficient semantic metadata, preventing agents from establishing the complex relationships required for effective RCA. To address these challenges, we present UModel, a unified ontological framework that shifts observability from data-centric to object-centric modeling. UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs. In addition, we introduce U-SPL, a pipeline-based query interface that enables agents to autonomously explore system topologies and correlate multimodal data. By re-modeling the "AIOps 2025 Challenge" dataset using UModel, the precision of root cause localization improved by 8%, demonstrating that enhanced data organization can significantly increase the accuracy of downstream tasks. UModel provides a scalable modeling framework that, in its deployment at Alibaba Cloud for more than one year, has served tens of thousands of users, sustained millions of operations per second, and delivered sub-second query latency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UModel, a unified object-centric ontological framework for modeling heterogeneous observability data (telemetry, entities, expert knowledge) as interconnected semantic graphs, along with the U-SPL pipeline-based query interface. The central empirical claim is that re-modeling the AIOps 2025 Challenge dataset with UModel yields an 8% improvement in root cause localization precision for LLM-based RCA agents; the work also reports production deployment at Alibaba Cloud serving tens of thousands of users at millions of operations per second with sub-second latency.
Significance. If the 8% precision gain can be isolated to the object-centric modeling and semantic graphs with fixed agent behavior, the approach could meaningfully improve the effectiveness of LLM agents on RCA tasks by addressing data fragmentation and lack of semantics in existing observability systems. The reported production metrics at Alibaba provide evidence of scalability, which would strengthen the practical contribution if the experimental attribution holds.
major comments (2)
- [Abstract] Abstract: the claim that re-modeling the AIOps 2025 Challenge dataset with UModel improved root cause localization precision by 8% supplies no baseline RCA method, no statement that the LLM agent/prompts were held fixed, no raw before/after scores, no statistical test, and no description of how the re-modeling altered input features or topology. This prevents attribution of the delta to UModel's ontological modeling versus confounding changes in data ingestion or evaluation protocol.
- [Evaluation] The experimental evaluation (referenced in the abstract) provides no ablation studies, controls, or exclusion criteria that would isolate the contribution of the semantic graphs and object-centric modeling from other factors such as query interface changes or dataset preprocessing.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater clarity on experimental attribution and controls. We address each major comment below and will revise the manuscript to strengthen the presentation of results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that re-modeling the AIOps 2025 Challenge dataset with UModel improved root cause localization precision by 8% supplies no baseline RCA method, no statement that the LLM agent/prompts were held fixed, no raw before/after scores, no statistical test, and no description of how the re-modeling altered input features or topology. This prevents attribution of the delta to UModel's ontological modeling versus confounding changes in data ingestion or evaluation protocol.
Authors: We agree the abstract is insufficiently detailed for proper attribution. In the revision we will expand it to name the baseline RCA method, state explicitly that the LLM agent and prompts remained fixed, report the raw before/after precision scores, include the statistical test performed, and describe the precise changes introduced by the object-centric re-modeling (addition of semantic graphs and standardized entity objects) while confirming that data ingestion and evaluation protocols were unchanged. revision: yes
-
Referee: [Evaluation] The experimental evaluation (referenced in the abstract) provides no ablation studies, controls, or exclusion criteria that would isolate the contribution of the semantic graphs and object-centric modeling from other factors such as query interface changes or dataset preprocessing.
Authors: We acknowledge that the current evaluation section does not contain the requested ablations or controls. In the revised manuscript we will add ablation experiments that isolate the effect of the semantic graphs and object-centric modeling while holding the U-SPL query interface and preprocessing steps fixed, include explicit controls for dataset changes, and state the exclusion criteria applied. These additions will allow readers to attribute the observed 8% gain more directly to the ontological modeling. revision: yes
Circularity Check
No circularity: empirical claim on external dataset with no fitted parameters or self-referential derivations.
full rationale
The paper's central claim is an 8% precision improvement on the external AIOps 2025 Challenge dataset after applying UModel. No equations, parameter fits, or predictions are described that reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The modeling method and query interface are presented as independent contributions, with the result framed as an external evaluation rather than a derived quantity equivalent to its inputs. This is the common case of a self-contained empirical report.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Heterogeneous telemetry, entities, and expert knowledge can be standardized as objects interconnected via semantic graphs to support more effective RCA by agents.
invented entities (2)
-
UModel
no independent evidence
-
U-SPL
no independent evidence
Forward citations
Cited by 1 Pith paper
-
A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis
Introduces AIOps2025 and RCA100 datasets for evaluating LLM agents on microservice failure diagnosis via localization, identification, and reasoning-grounded-in-evidence dimensions.
Reference graph
Works this paper leans on
-
[1]
Replit: Online ide & code editor for every language,
I. Replit, “Replit: Online ide & code editor for every language,” https: //replit.com/, Replit, Inc., 2024
2024
-
[2]
Google antigravity: Next-generation agentic development platform,
G. LLC, “Google antigravity: Next-generation agentic development platform,” https://antigravity.google/, Google LLC, 2025
2025
-
[3]
Gartner market guide for aiops: Es- sential reading for itops and sre,
IBM, “Gartner market guide for aiops: Es- sential reading for itops and sre,” Feb
-
[4]
Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre
[Online]. Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre
-
[5]
Aiops maturity model: From automation to autonomous it operations,
Gartner, “Aiops maturity model: From automation to autonomous it operations,” 2025. [Online]. Available: https://www.gartner.com/en/ information-technology/glossary/aiops-artificial-intelligence-operations
2025
-
[6]
A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,
J. Diaz-De-Arcaya, A. I. Torre-Bastida, G. Z ´arate, R. Mi ˜n´on, and A. Almeida, “A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–30, 2023
2023
-
[7]
A survey of aiops for failure management in the era of large language models,
L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops for failure management in the era of large language models,”arXiv preprint arXiv:2406.11213, 2024
-
[8]
Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,
H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y . Liu, Y . Zhao, D. Pei, Y . Fenget al., “Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,” inProceedings of the 2018 world wide web conference, 2018, pp. 187–196
2018
-
[9]
Drain: An online log parsing approach with fixed depth tree,
P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of ICWS, 2017
2017
-
[10]
From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,
Z. Xie, C. Pei, W. Li, H. Jiang, L. Su, J. Li, G. Xie, and D. Pei, “From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1739–1749
2023
-
[11]
Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,
Q. Zhou, C. Pei, F. Sun, H. Jing, Z. Gao, H. Zhang, G. Xie, D. Pei, and J. Li, “Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,” inProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=LWQ4zu9SdQ
2025
-
[12]
Actionable and interpretable fault localization for recurring failures in online service systems,
Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008
2022
-
[13]
Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,
J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,”ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–39, 2022
2022
-
[14]
Failure diagnosis in microservice systems: A comprehensive survey and analysis,
S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–55, 2025
2025
-
[15]
A survey on failure analysis and fault injection in ai systems,
G. Yu, G. Tan, H. Huang, Z. Zhang, P. Chen, R. Natella, Z. Zheng, and M. R. Lyu, “A survey on failure analysis and fault injection in ai systems,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–42, 2026
2026
-
[16]
Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,
G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565
2023
-
[17]
Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,
Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 4966–4974
2024
-
[18]
Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,
C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431
2025
-
[19]
Prometheus: Monitoring system and time series database,
Prometheus Team, “Prometheus: Monitoring system and time series database,” https://prometheus.io/, 2012, accessed: 2025-08-21
2012
-
[20]
Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,
Elastic NV, “Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,” https://www.elastic.co/elasticsearch, 2010, ac- cessed: 2025-08-21
2010
-
[21]
Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,
Y . Sun, J. Wang, Z. Li, X. Nie, M. Ma, S. Zhang, Y . Ji, L. Zhang, W. Long, H. Chenet al., “Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,” in2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2025, pp. 809–813
2025
-
[22]
The unix™ programming envi- ronment,
B. W. Kernighan and J. R. Mashey, “The unix™ programming envi- ronment,”Software: Practice and Experience, vol. 9, no. 1, pp. 1–15, 1979
1979
-
[23]
Promassistant: Leveraging large language models for text-to-promql,
C. Zhang, B. Zhang, D. Yang, X. Peng, M. Chen, S. Xie, G. Chen, W. Bi, and W. Li, “Promassistant: Leveraging large language models for text-to-promql,”arXiv preprint arXiv:2503.03114, 2025
-
[24]
Cypher: An evolving query language for property graphs,
N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V . Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor, “Cypher: An evolving query language for property graphs,” inProceed- ings of the 2018 international conference on management of data, 2018, pp. 1433–1445
2018
-
[25]
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
X. Hou, Y . Zhao, S. Wang, and H. Wang, “Model context protocol (mcp): Landscape, security threats, and future research directions,”arXiv preprint arXiv:2503.23278, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhanet al., “React: Synergizing reasoning and acting in language models,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
OpenClaw: Personal AI Assistant,
“OpenClaw: Personal AI Assistant,” https://github.com/openclaw/ openclaw
-
[28]
Chateval: Towards better llm-based evaluators through multi- agent debate,
C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,” inInternational conference on learning representations, vol. 2024, 2024, pp. 9079–9093
2024
-
[29]
Re- flexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023
2023
-
[30]
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,
L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634
2023
-
[31]
Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,
R. Ding, X. Liu, S. Yang, Q. Huang, B. Xie, R. Sun, Z. Zhang, and B. Cui, “Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 258–273
2024
-
[32]
µmon: Empowering microsecond-level network monitoring with wavelets,
H. Zheng, C. Huang, X. Han, J. Zheng, X. Wang, C. Tian, W. Dou, and G. Chen, “µmon: Empowering microsecond-level network monitoring with wavelets,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 274–290
2024
-
[33]
Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,
X. Chen, Q. Xiao, H. Liu, Q. Huang, D. Zhang, X. Liu, L. Hu, H. Zhou, C. Wu, and K. Ren, “Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 291–310
2024
-
[34]
Ipd: Detecting traffic ingress points at isps,
S. Mehner, H. Reelfs, I. Poese, and O. Hohlfeld, “Ipd: Detecting traffic ingress points at isps,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 778–793
2024
-
[35]
Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,
H. Zhang, G. Liu, X. Shi, Y . Li, D. He, J. Wang, Z. Wang, Y . Zhu, K. Ruan, W. Caoet al., “Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 1254–1256
2025
-
[36]
The syslog protocol,
R. Gerhards, “The syslog protocol,” RFC 5424, 2009. [Online]. Available: https://www.rfc-editor.org/rfc/rfc5424
2009
-
[37]
The bsd packet filter: A new architecture for user-level packet capture,
S. McCanne and V . Jacobson, “The bsd packet filter: A new architecture for user-level packet capture,” inProceedings of the USENIX Winter Conference, 1993. [Online]. Available: https://www.tcpdump.org/papers/ bpf-usenix93.pdf
1993
-
[38]
The ebpf runtime in the linux kernel,
T. Alabiet al., “The ebpf runtime in the linux kernel,” arXiv preprint,
-
[39]
Available: https://arxiv.org/abs/2410.00026
[Online]. Available: https://arxiv.org/abs/2410.00026
-
[40]
Cisco systems netflow services export version 9,
B. Claise, “Cisco systems netflow services export version 9,” RFC 3954, 2004. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3954
2004
-
[41]
Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,
P. Phaal, S. Panchen, and N. McKee, “Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,” RFC 3176, 2001. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3176
2001
-
[42]
Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,
S. Wang, M. Zhang, X. Li, Q. Peng, H. Yu, Z. Wang, M. Xu, X. Hu, J. Yang, and X. Shi, “Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 481–495
2025
-
[43]
P4: Programming protocol-independent packet processors,
P. Bosshart, G. Gibb, H. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, “P4: Programming protocol-independent packet processors,”ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014
2014
-
[44]
Robust anomaly detection for multivariate time series through stochastic recurrent neural network,
Y . Su, Y . Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” inProceedings of KDD, 2019
2019
-
[45]
Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,
Z. Wang, C. Pei, M. Ma, X. Wang, Z. Li, D. Pei, S. Rajmohan, D. Zhang, Q. Lin, H. Zhanget al., “Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,” inProceedings of the ACM web conference 2024, 2024, pp. 3096–3105
2024
-
[46]
Anomaly transformer: Time series anomaly detection with association discrepancy,
J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=LzQQ89U1qm
2022
-
[47]
Tshape: Rescuing machine learning models from complex shapelet anomalies,
H. Cui, J. Li, H. Si, Q. Zhou, C. Pei, G. Xie, and D. Pei, “Tshape: Rescuing machine learning models from complex shapelet anomalies,” in2025 IEEE 36th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2025, pp. 9–14
2025
-
[48]
Deeplog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of CCS, 2017
2017
-
[49]
Logbert: Log anomaly detection via bert,
H. Guo, S. Yuan, J. Wuet al., “Logbert: Log anomaly detection via bert,” inProceedings of IJCNN, 2022
2022
-
[50]
Microrca: Root cause localization of performance issues in microservices,
L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause localization of performance issues in microservices,” inIEEE/IFIP Network Operations and Management Symposium (NOMS), 2020. [Online]. Available: https://github.com/elastisys/MicroRCA
2020
-
[51]
Global, passive detection of connection tampering,
R. Sundara Raman, L.-H. Merino, K. Bock, M. Fayed, D. Levin, N. Sullivan, and L. Valenta, “Global, passive detection of connection tampering,” inProceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 622–636
2023
-
[52]
Localizing failure root causes in a microservice through causality inference,
Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” inProceedings of IEEE/ACM IWQoS,
-
[53]
Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf
[Online]. Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf
2020
-
[54]
Towards llm-based failure localization in production-scale networks,
C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tianet al., “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 496–511
2025
-
[55]
Microhecl: High-efficient root cause localization in large-scale microservice systems,
D. Liu, C. He, X. Peng, F. Lin, C. Zhang, S. Gong, Z. Li, J. Ou, and Z. Wu, “Microhecl: High-efficient root cause localization in large-scale microservice systems,” inICSE-SEIP, 2021. [Online]. Available: https://arxiv.org/abs/2103.01782
-
[56]
Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,
W. Liu, K. Qian, Z. Li, T. Xu, Y . Liu, W. Wang, Y . Zhang, J. Li, S. Zhu, X. Liet al., “Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 527–540
2025
-
[57]
Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,
B. Yang, H. Hu, Y . Li, Y . Li, X. Tang, B. Tian, G. Wu, J. Xu, X. Zhang, F. Chenet al., “Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 512–526
2025
-
[58]
Robust failure diagnosis of microservice system through multimodal data,
S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jin, D. Zhang, Z. Zhu, and D. Pei, “Robust failure diagnosis of microservice system through multimodal data,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2302.10512
-
[59]
Anomaly detection from system tracing data using multimodal deep learning,
S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” inIEEE International Conference on Cloud Computing (CLOUD), 2019
2019
-
[60]
Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,
X. Nie, H. Cui, C. Pei, H. Si, K. Xiang, J. Li, Y . Li, G. Xie, and D. Pei, “Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 335–346
2025
-
[61]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanuet al., “Toolformer: Language models can teach themselves to use tools,” arXiv preprint,
-
[62]
Toolformer: Language Models Can Teach Themselves to Use Tools
[Online]. Available: https://arxiv.org/abs/2302.04761
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
A survey of aiops in the era of large language models,
L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops in the era of large language models,” arXiv preprint, 2025. [Online]. Available: https://arxiv.org/abs/2507.12472 Database Kubernetes Node Service Service Traces Kubernetes Pod Service Logs Node Logs Pod Logs Service Metrics Node Metrics Pod Metrics ...
-
[64]
Query{pod2} {metric}data
-
[65]
Query{pod3} {metric}data
-
[66]
entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops
Aggregate and return. .entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops. metric.pod’,’{metric}’,’ range’,’’,aggregate=true) Aggregates multi-pod metric val- ues in one query and returns unified results. Q2 (Knowledge) 1) Get ownership (node/service)
-
[67]
Collect depth from trace/logs
-
[68]
Merge into a unified dep graph/table
-
[69]
Expand to 4 hops (BFS / multi-join)
-
[70]
topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops
Summarize and filter related entities. .topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops. pod‘{ entity id :’id’}) −[e]−(d)−[f]−(g)−[h]−(j)−[k ]−(l)RETURN s, d, g, j, l‘) 1-query 4-hop subgraph extraction; No manual multi-join/BFS. Q3 (Data+Knowledge) 1) Resolve the Pod’s node
-
[71]
List all Pods on that node
-
[72]
Query{metric}for these Pods
-
[73]
pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g
topo|graph−call cypher( MATCH (s:aiops@aiops. pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g. entity id AS entity ids )}
-
[74]
aiops.service
entity set with(domain=’aiops ’, name=’aiops.node’, ids=[’ entity ids’])|entity−call get metric(’aiops’,’aiops. metric.node’,’{metric}’,’ range’,’’,aggregate=false) Topology query first finds peer Pods on the same node, then batch metric retrieval is executed in one flow without manual joins. must query each Pod’s metric data individually and perform aggr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.