pith. machine review for the scientific record. sign in

arxiv: 2605.12729 · v1 · submitted 2026-05-12 · 💻 cs.NI · cs.AI· cs.CR

Recognition: no theorem link

Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.CR
keywords large language modelsNetOpsAIOpsagentic systemsassurance contractsoperational reliabilityworkflow evaluationsecurity risks
0
0 comments X

The pith

Operational reliability in LLM-based NetOps and AIOps comes from the machinery around the model rather than the model itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys literature on large language models used for agentic workflows in network operations and AIOps, covering tasks from incident investigation to limited self-healing. It organizes this work around a hierarchy of autonomy, tool scope, evidence traces, and assurance contracts that specify what agents may observe, propose, and execute along with required checks and rollback options. The central claim is that reliability is engineered through these surrounding controls, permissions, and policies rather than inherent model properties. Evaluation must therefore shift from static question answering to workflow-centered measures such as trace quality, bounded tool use, and sandboxed replays. The survey also flags acute security, privacy, and governance risks when agents operate near operational control surfaces.

Core claim

The paper claims that a consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. This pattern can be organized around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute, and they specify the checks that must pass before any action is allowed. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model, including permissions, policies, and rollback options. Evaluation should therefore move beyond static question answering to,

What carries the argument

The hierarchy of autonomy, tool scope, evidence traces, and assurance contracts, which structures agent workflows from evidence gathering to action while enforcing operational checks and constraints.

If this is right

  • Agentic systems require workflow-centred evaluation that includes trace quality, bounded tool use, safe proposal generation, and replay in sandboxed environments with rollback-aware scoring.
  • Progress depends on treating autonomy as a constrained operational control problem whose outputs must be reliable, auditable, and securely deployable.
  • Security, privacy, and governance risks become acute when agents sit close to operational control surfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchy could be tested for fit in other agentic settings such as robotic control or financial trading systems.
  • Formal verification techniques could be added to strengthen assurance contracts.
  • Empirical studies could quantify how much each layer of the hierarchy contributes to measured reliability.

Load-bearing premise

That the reviewed tasks exhibit a consistent pattern that can be usefully organized around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts.

What would settle it

A detailed literature scan that finds no shared structure across the listed tasks or shows reliability deriving primarily from the LLM without the additional machinery of contracts and checks.

Figures

Figures reproduced from arXiv: 2605.12729 by Jon Crowcroft, Muhammad Bilal, Ruizhi Wang, Schahram Dustdar, Xiaolong Xu.

Figure 1
Figure 1. Figure 1: A ladder-of-autonomy taxonomy for agentic NetOps and AIOps. As systems move from read-only [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Unified evidence-to-action control loop across NetOps and AIOps. The system is partially observed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Operational artefacts by trust level and staleness risk. The same artefact can move over time. For [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A practical “LLM-in-ops” stack. Reliability comes mainly from typed tool interfaces, an explicit [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Operational agent state machine with mandatory gates and stop conditions. Read-only states can [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Operational budgets as first-class constraints on agent loops. Tool, token, and time budgets bound [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: What a human approves in a guarded planner–executor system. The review object is a bundle [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Autonomy–risk coupling and the gated action envelope. As the agent moves from copilot to closed-loop control (left), the severity of key failure modes rises (right). A proposed action a is therefore routed into an execution policy: allow-listed low-risk actions may proceed only through a non-bypassable gate g(a, E, Π) with explicit checks and a rollback plan, while high-risk actions remain proposal-only by… view at source ↗
Figure 9
Figure 9. Figure 9: Agentic NetOps loop specialised to high-consequence change. The LLM is useful as a workflow [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Common NetOps invariants that can form the verification wall. The point is not the notation; it [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Update as a protocol, not a tool call. Canary and staged expansion turn transient-safety [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Workflow view of agentic AIOps. The central loop is query-driven diagnosis: plan the next [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evaluation ladder with matched reporting burden. Left: an evaluation ladder from offline corpora to canary-in-production studies, where realism and operational risk increase. Right: the minimum information that must be reported to make claims at each rung (tasks/data, tool surface, budgets, gates, trace logging, robustness). The connector highlights that moving down the ladder requires stricter evidence, … view at source ↗
Figure 14
Figure 14. Figure 14: Process quality is scorable. A trace is evaluated by discriminating value (did the query narrow [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Threat model for agentic NetOps and AIOps. Untrusted artefacts (tickets, runbooks, dashboards, [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
read the original abstract

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys the application of large language models to agentic NetOps and AIOps tasks including telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. It organizes the literature around a proposed hierarchy of autonomy, tool scope, evidence traces, and assurance contracts, arguing that operational reliability derives primarily from the surrounding machinery and constraints rather than the LLM itself. The work further advocates shifting evaluation from static QA to workflow-centered metrics (trace quality, bounded tool use, sandbox replay, canary trials) and examines security, privacy, and governance risks when agents approach operational control surfaces.

Significance. If the claimed consistent pattern across domains holds with explicit per-work mappings, the survey offers a useful organizing framework for designing constrained, auditable agentic systems in high-stakes operations. The emphasis on workflow evaluation and assurance contracts addresses a genuine gap between model capabilities and deployable reliability; the safety discussion is timely given the proximity to live control planes.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Literature Organization): The central claim that 'a consistent pattern appears across' the six domains and that 'operational reliability does not come chiefly from the model itself' is asserted via high-level summaries but lacks explicit per-paper mappings to the four elements (autonomy hierarchy, tool scope, evidence traces, assurance contracts) that isolate the machinery as the decisive factor. Without tables or structured breakdowns showing specific failures of model-only approaches versus successes attributable to contracts/traces, the pattern remains framed rather than evidenced.
  2. [§4] §4 (Evaluation): The call for workflow-centred evaluation (trace quality, bounded tool use, sandbox replay, rollback-aware scoring) is well-motivated but the section provides no concrete metrics, scoring rubrics, or example evaluation protocols drawn from the surveyed works; this leaves the recommendation at the level of desiderata rather than actionable guidance that could be adopted by the community.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction use 'agentic' and 'workflows' without an early formal definition or diagram; a small taxonomy figure would clarify the hierarchy before the literature sections.
  2. [§5] Several citations in the safety section (§5) appear to be recent preprints; adding a note on the recency and potential volatility of those sources would help readers assess the stability of the risk claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments. We address the major points below and will incorporate revisions to provide more explicit mappings and concrete evaluation guidance.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Literature Organization): The central claim that 'a consistent pattern appears across' the six domains and that 'operational reliability does not come chiefly from the model itself' is asserted via high-level summaries but lacks explicit per-paper mappings to the four elements (autonomy hierarchy, tool scope, evidence traces, assurance contracts) that isolate the machinery as the decisive factor. Without tables or structured breakdowns showing specific failures of model-only approaches versus successes attributable to contracts/traces, the pattern remains framed rather than evidenced.

    Authors: We appreciate this observation. Although the manuscript structures the discussion around the four elements with illustrative examples from the literature, we agree that a structured table would better evidence the consistent pattern. In the revised manuscript, we will introduce a summary table in Section 3 that explicitly maps each referenced work to the autonomy hierarchy, tool scope, evidence traces, and assurance contracts. This will highlight cases where model-only approaches fail and where the surrounding machinery provides the reliability, thereby strengthening the central claim. revision: yes

  2. Referee: [§4] §4 (Evaluation): The call for workflow-centred evaluation (trace quality, bounded tool use, sandbox replay, rollback-aware scoring) is well-motivated but the section provides no concrete metrics, scoring rubrics, or example evaluation protocols drawn from the surveyed works; this leaves the recommendation at the level of desiderata rather than actionable guidance that could be adopted by the community.

    Authors: We concur that the evaluation section would benefit from greater specificity. We will revise §4 to include concrete metrics and rubrics extracted from the surveyed papers, such as trace quality scores used in root-cause analysis studies, examples of bounded tool use from configuration works, and sandbox replay protocols from self-healing literature. Additionally, we will outline an example workflow evaluation protocol that the community could adopt, moving the recommendations from desiderata to actionable guidance. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes literature without derivations or self-referential reductions

full rationale

The paper is a literature survey that proposes an organizational hierarchy (autonomy, tool scope, evidence traces, assurance contracts) to frame existing work on NetOps/AIOps tasks. No equations, fitted parameters, predictions, or derivations appear anywhere in the text. The central observation that reliability depends on surrounding machinery is presented as a pattern distilled from cited external literature rather than derived from the authors' prior results or by construction from the survey's own inputs. Self-citations, if present, are not load-bearing for any claim; the paper contains no uniqueness theorems, ansatzes, or renamings that reduce to self-reference. The derivation chain is therefore self-contained as a descriptive re-organization with no internal circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Survey paper with no free parameters, no invented entities, and only standard domain assumptions about agent workflows.

axioms (1)
  • domain assumption Agent-based operations work as workflows with permissions, policies, checks, and rollback options
    Invoked to frame the hierarchy of autonomy and assurance contracts.

pith-pipeline@v0.9.0 · 5628 in / 1117 out tokens · 50292 ms · 2026-05-14T19:50:47.679112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

170 extracted references · 170 canonical work pages · 1 internal anchor

  1. [1]

    A general approach to network configuration analysis,

    A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, and T. Millstein, “A general approach to network configuration analysis,” inProceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15). USENIX Association, 2015, pp. 469–483

  2. [2]

    Header space analysis: Static checking for networks,

    P. Kazemian, G. Varghese, and N. McKeown, “Header space analysis: Static checking for networks,” inProceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012

  3. [3]

    VeriFlow: Verifying Network-Wide invariants in real time,

    A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey, “VeriFlow: Verifying Network-Wide invariants in real time,” in10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX Association, Apr. 2013, pp. 15–27

  4. [4]

    N. R. Murphy, B. Beyer, C. Jones, and J. Petoff,Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, 2016. 44

  5. [5]

    Forsgren, J

    N. Forsgren, J. Humble, and G. Kim,Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution Press, 2018

  6. [6]

    The vision of autonomic computing,

    J. Kephart and D. Chess, “The vision of autonomic computing,”Computer, vol. 36, no. 1, pp. 41–50, 2003

  7. [7]

    Monitorassistant: Simplifying cloud service monitoring via large language models,

    Z. Yu, M. Ma, C. Zhang, S. Qin, Y . Kang, C. Bansal, S. Rajmohan, Y . Dang, C. Pei, D. Pei, Q. Lin, and D. Zhang, “Monitorassistant: Simplifying cloud service monitoring via large language models,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. ACM, 2024, pp. 38–49

  8. [8]

    Xpert: Empowering incident management with query recommendations via large language models,

    Y . Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y . Kang, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Xpert: Empowering incident management with query recommendations via large language models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. Association for Computing Machinery, 2024

  9. [9]

    Netassistant: Dialogue based network diagnosis in data center networks,

    H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Dong, W. Jiang, and Y . Wang, “Netassistant: Dialogue based network diagnosis in data center networks,” inProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’24). USENIX Association, 2024

  10. [10]

    Automatic root cause analysis via large language models for cloud incidents,

    Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the 19th European Conference on Computer Systems (EuroSys ’24). ACM, 2024, pp. 674–688

  11. [11]

    Abstractions for network update,

    M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker, “Abstractions for network update,” inProceedings of the ACM SIGCOMM 2012 Conference. ACM, 2012, pp. 323–334

  12. [12]

    Retrieval-augmented generation for knowledge-intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS 2020), 2020

  13. [13]

    Toolformer: language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: language models can teach themselves to use tools,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

  14. [14]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inProceedings of the International Conference on Learning Representations, ser. ICLR 2023, 2023. [Online]. Available: https://openreview.net/forum?id=WE vluYUL-X

  15. [15]

    A survey on intent-based networking,

    A. Leivadeas and M. Falkner, “A survey on intent-based networking,”IEEE Communications Surveys & Tutorials, vol. 25, no. 1, pp. 625–655, 2023

  16. [16]

    Intent-based networking for the enterprise: a modern network architecture,

    M. Falkner and J. Apostolopoulos, “Intent-based networking for the enterprise: a modern network architecture,”Communications of the ACM, vol. 65, no. 11, pp. 108–117, 2022

  17. [17]

    NetComplete: Practical network-wide configuration synthesis with autocompletion,

    A. El-Hassany, P. Tsankov, L. Vanbever, and M. T. Vechev, “NetComplete: Practical network-wide configuration synthesis with autocompletion,” inProceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2018, pp. 579–594

  18. [18]

    A survey of aiops methods for failure management,

    P. Notaro, J. Cardoso, and M. Gerndt, “A survey of aiops methods for failure management,”ACM Trans. Intell. Syst. Technol., vol. 12, no. 6, 2021

  19. [19]

    Webarena: A realistic web environment for building autonomous agents,

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024

  20. [20]

    API-bank: A comprehensive benchmark for tool-augmented LLMs,

    M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li, “API-bank: A comprehensive benchmark for tool-augmented LLMs,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 3102–3116

  21. [21]

    Gorilla: Large language model connected with massive apis,

    S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” inAdvances in Neural Information Processing Systems (NeurIPS 2024), vol. 37, 2024, pp. 126 544–126 565

  22. [22]

    Safely and automatically updating in-network acl configurations with intent language,

    B. Tian, X. Zhang, E. Zhai, H. H. Liu, Q. Ye, C. Wang, X. Wu, Z. Ji, Y . Sang, M. Zhanget al., “Safely and automatically updating in-network acl configurations with intent language,” inProceedings of the ACM Special Interest Group on Data Communication, 2019, pp. 214–226

  23. [23]

    SWE-bench: Can language models resolve real-world github issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024

  24. [24]

    Understanding BGP misconfiguration,

    R. Mahajan, D. Wetherall, and T. Anderson, “Understanding BGP misconfiguration,” inProceedings of the ACM SIGCOMM 2002 Conference, 2002, pp. 3–16

  25. [25]

    Openflow: enabling innovation in campus networks,

    N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: enabling innovation in campus networks,”SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, p. 69–74, 2008

  26. [26]

    Consistent updates for software-defined networks: change you can believe in!

    M. Reitblatt, N. Foster, J. Rexford, and D. Walker, “Consistent updates for software-defined networks: change you can believe in!” inProceedings of the 10th ACM Workshop on Hot Topics in Networks, ser. HotNets-X, 2011, pp. 1–6

  27. [27]

    G-rca: A generic root cause analysis platform for service quality management in large ip networks,

    H. Yan, L. Breslau, Z. Ge, D. Massey, D. Pei, and J. Yates, “G-rca: A generic root cause analysis platform for service quality management in large ip networks,”IEEE/ACM Transactions on Networking, vol. 20, no. 6, pp. 1734–1747, 2012

  28. [28]

    Mining causality of network events in log data,

    S. Kobayashi, K. Otomo, K. Fukuda, and H. Esaki, “Mining causality of network events in log data,”IEEE Transactions on Network and Service Management, vol. 15, no. 1, pp. 53–67, 2018

  29. [29]

    Causal analysis of network logs with layered protocols and topology knowledge,

    S. Kobayashi, K. Otomo, and K. Fukuda, “Causal analysis of network logs with layered protocols and topology knowledge,” in2019 15th International Conference on Network and Service Management (CNSM), 2019, pp. 1–9

  30. [30]

    A general approach to network configuration verification,

    R. Beckett, A. Gupta, R. Mahajan, and D. Walker, “A general approach to network configuration verification,” inProceedings of the ACM SIGCOMM 2017 Conference, 2017, pp. 155–168

  31. [31]

    Checking beliefs in dynamic networks,

    N. P. Lopes, N. Bjørner, P. Godefroid, K. Jayaraman, and G. Varghese, “Checking beliefs in dynamic networks,” inProceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15), 2015, pp. 499–512. 45

  32. [32]

    Lessons from the evolution of the batfish configuration analysis tool,

    M. Brown, A. Fogel, D. Halperin, V . Heorhiadi, R. Mahajan, and T. Millstein, “Lessons from the evolution of the batfish configuration analysis tool,” inProceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 122–135

  33. [33]

    Pinpoint: Problem determination in large, dynamic internet services,

    M. Y . Chen, E. Kiciman, E. Fratkin, A. Fox, and E. A. Brewer, “Pinpoint: Problem determination in large, dynamic internet services,” inProceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE, 2002, pp. 595–604

  34. [34]

    Orca: Differential bug localization in large-scale services,

    R. Bhagwan, R. Kumar, C. S. Maddila, and A. A. Philip, “Orca: Differential bug localization in large-scale services,” inProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 493–509

  35. [35]

    Towards intelligent incident management: why we need it and how we make it,

    Z. Chen, Y . Kang, L. Li, X. Zhang, H. Zhang, H. Xu, Y . Zhou, L. Yang, J. Sun, Z. Xuet al., “Towards intelligent incident management: why we need it and how we make it,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1487–1497

  36. [36]

    How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems,

    J. Jiang, W. Lu, J. Chen, Q. Lin, P. Zhao, Y . Kang, H. Zhang, Y . Xiong, F. Gao, Z. Xuet al., “How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineeri...

  37. [37]

    Enjoy your observability: An industrial survey of microservice tracing and analysis,

    B. Li, X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu, “Enjoy your observability: An industrial survey of microservice tracing and analysis,”Empirical Software Engineering, vol. 27, no. 1, p. 25, 2022

  38. [38]

    MRCA: Metric-level root cause analysis for microservices via multi-modal data,

    Y . Wang, Z. Zhu, Q. Fu, Y . Ma, and P. He, “MRCA: Metric-level root cause analysis for microservices via multi-modal data,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24), 2024, pp. 1057–1068

  39. [39]

    Hemirca: Fine-grained root cause analysis for microservices with heterogeneous data sources,

    Z. Zhu, C. Lee, X. Tang, and P. He, “Hemirca: Fine-grained root cause analysis for microservices with heterogeneous data sources,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, 2024

  40. [40]

    X-trace: A pervasive network tracing framework,

    R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-trace: A pervasive network tracing framework,” inProceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’07), 2007, pp. 271–284

  41. [41]

    Canopy: An end-to-end performance tracing and analysis system,

    J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V . Venkataraman, K. Veeraraghavan, and Y . J. Song, “Canopy: An end-to-end performance tracing and analysis system,” inProceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP ’17), 2017, pp. 34–50

  42. [42]

    An empirical study of policy as code: Adoption, purpose, and maintenance,

    R. Opdebeeck, M. Alfadel, A. Rahman, Y . Kashiwa, J. F. Ferreira, R. G. Kula, and C. D. Roover, “An empirical study of policy as code: Adoption, purpose, and maintenance,” inProceedings of the 23rd International Conference on Mining Software Repositories (MSR 2026), 2026

  43. [43]

    Automated infrastructure as code program testing,

    D. Sokolowski, D. Spielmann, and G. Salvaneschi, “Automated infrastructure as code program testing,”IEEE Transactions on Software Engineering, vol. 50, no. 6, pp. 1585–1599, 2024

  44. [44]

    Change management in physical network lifecycle automation,

    M. Al-Fares, V . Beauregard, K. Grant, A. Griffith, J. Hasan, C. Huang, Q. Leng, J. Li, A. Lin, Z. Liu, A. Mansy, B. Martinusen, N. Mehta, J. C. Mogul, A. Narver, A. Nigham, M. Obenberger, S. Smith, K. Steinkraus, S. Sun, E. Thiele, and A. Vahdat, “Change management in physical network lifecycle automation,” in2023 USENIX Annual Technical Conference (USEN...

  45. [45]

    Learning from lessons learned: Preliminary findings from a study of learning from failure,

    J. Sillito and M. Pope, “Learning from lessons learned: Preliminary findings from a study of learning from failure,” inProceedings of the 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering, 2024, pp. 97–102

  46. [46]

    Drain: An online log parsing approach with fixed depth tree,

    P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE International Conference on Web Services (ICWS), 2017, pp. 33–40

  47. [47]

    Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2017, pp. 1285–1298

  48. [48]

    Loghub: A large collection of system log datasets for ai-driven log analytics,

    J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 355–366

  49. [49]

    Opentelemetry specification,

    OpenTelemetry Authors, “Opentelemetry specification,” Cloud Native Computing Foundation (CNCF), 2024, accessed: 2026-02-02

  50. [50]

    An empirical study on change-induced incidents of online service systems,

    Y . Wu, B. Chai, Y . Li, B. Liu, J. Li, Y . Yang, and W. Jiang, “An empirical study on change-induced incidents of online service systems,” inProceedings of the IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2023, pp. 234–245

  51. [51]

    Identifying linked incidents in large-scale online service systems,

    Y . Chen, X. Yang, H. Dong, X. He, H. Zhang, Q. Lin, J. Chen, P. Zhao, Y . Kang, F. Gao, Z. Xu, and D. Zhang, “Identifying linked incidents in large-scale online service systems,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, pp. 304–314

  52. [52]

    Llexus: an ai agent system for incident management,

    P. Las-Casas, A. G. Kumbhare, R. Fonseca, and S. Agarwal, “Llexus: an ai agent system for incident management,”SIGOPS Oper. Syst. Rev., vol. 58, no. 1, 2024

  53. [53]

    Tool learning with foundation models,

    Y . Qin, S. Hu, Y . Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, X. Zhou, Y . Huang, C. Xiao, C. Han, Y . R. Fung, Y . Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y . Ye, B. Li, Z. Tang, J. Yi, Y . Zhu, Z. Dai, L. Yan, X. Cong, Y . Lu, W. Zhao, Y . Huang, J. Yan, X. Han, X. Sun, D. Li, J. Phang, C. Yang, T. Wu, H. Ji, G. Li, Z. L...

  54. [54]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs,

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inThe Twelfth International Conference on Learning Representations, 2024

  55. [55]

    Humble and D

    J. Humble and D. Farley,Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison- Wesley Professional, 2010

  56. [56]

    Risk based planning of network changes in evolving data centers,

    O. Alipourfard, J. Gao, J. Koenig, C. Harshaw, A. Vahdat, and M. Yu, “Risk based planning of network changes in evolving data centers,” inProceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 414–429

  57. [57]

    Artificial intelligence risk management framework: Generative artificial intelligence profile,

    C. Autio, R. Schwartz, J. Dunietz, S. Jain, M. Stanley, E. Tabassi, P. Hall, and K. Roberts, “Artificial intelligence risk management framework: Generative artificial intelligence profile,” Tech. Rep., 2024-07-26 04:07:00 2024. 46

  58. [58]

    Creating characteristically auditable agentic ai systems,

    C. C. Phiri, “Creating characteristically auditable agentic ai systems,” inProceedings of Intelligent Robotics FAIR 2025 (IntRob ’25), 2025, pp. 1–14

  59. [59]

    What do llms need to synthesize correct router configurations?

    R. Mondal, A. Tang, R. Beckett, T. Millstein, and G. Varghese, “What do llms need to synthesize correct router configurations?” in Proceedings of the 22nd ACM Workshop on Hot Topics in Networks (HotNets ’23). Association for Computing Machinery, 2023, pp. 189–195

  60. [60]

    Meshagent: Enabling reliable network management with large language models,

    Y . Zhou, K. Hsieh, S. K. Mani, S. Kandula, and Z. Liu, “Meshagent: Enabling reliable network management with large language models,”Proc. ACM Meas. Anal. Comput. Syst., vol. 9, no. 3, Dec. 2025

  61. [61]

    Artificial intelligence risk management framework (ai rmf 1.0),

    E. Tabassi, “Artificial intelligence risk management framework (ai rmf 1.0),” Tech. Rep., 2023-01-26 05:01:00 2023

  62. [62]

    Dapper, a large-scale distributed systems tracing infrastructure,

    B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Inc., Tech. Rep., 2010, technical report (widely circulated)

  63. [63]

    Pivot tracing: Dynamic causal monitoring for distributed systems,

    J. Mace, R. Roelke, and R. Fonseca, “Pivot tracing: Dynamic causal monitoring for distributed systems,” inProceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). Association for Computing Machinery, 2015, pp. 378–393

  64. [64]

    Real time network policy checking using header space analysis,

    P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and S. Whyte, “Real time network policy checking using header space analysis,” inProceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, 2013, p. 99–112

  65. [65]

    Accuracy, scalability, coverage: A practical configuration verifier on a global wan,

    F. Ye, D. Yu, E. Zhai, H. H. Liu, B. Tian, Q. Ye, C. Wang, X. Wu, T. Guo, C. Jin, D. She, Q. Ma, B. Cheng, H. Xu, M. Zhang, Z. Wang, and R. Fonseca, “Accuracy, scalability, coverage: A practical configuration verifier on a global wan,” inProceedings of the ACM SIGCOMM 2020 Conference, 2020, pp. 599–614

  66. [66]

    Itbench: evaluating ai agents across diverse real-world it automation tasks,

    S. Jha, R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J.-w. Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswami, T....

  67. [67]

    Agentbench: Evaluating llms as agents,

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” inICLR 2024, 2024

  68. [68]

    τ-bench: A benchmark for tool-agent-user interaction in real-world domains,

    S. Yao, N. Shinn, P. Razavi, and K. R. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,” inInternational Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://openreview.net/forum?id=roNSXZpUDN

  69. [69]

    McCormack

    I. McCormack. (2025, Jan.) Preserving integrity in the age of generative AI. National Cyber Security Centre (NCSC). [Online]. Available: https://www.ncsc.gov.uk/blog-post/preserving-integrity-in-age-generative-ai

  70. [70]

    Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,” inProceedings of the 2023 Workshop on Artificial Intelligence and Security (AISec ’23). Association for Computing Machinery, 2023, pp. 79–90

  71. [71]

    StruQ: Defending against prompt injection with structured queries,

    S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 2383–2400

  72. [72]

    When AIOps become “AI oops

    D. Pasquini, E. M. Kornaropoulos, G. Ateniese, O. Akgul, A. Theocharis, and P. Efstathopoulos, “When AIOps become “AI oops”: Subverting LLM-driven IT operations via telemetry manipulation,” arXiv:2508.06394, 2025

  73. [73]

    Failures and fixes: A study of software system incident response,

    J. Sillito and E. Kutomi, “Failures and fixes: A study of software system incident response,” in2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020, pp. 185–195

  74. [74]

    Trust in collaborative automation in high stakes software engineering work: A case study at nasa,

    D. G. Widder, L. Dabbish, J. D. Herbsleb, A. Holloway, and S. Davidoff, “Trust in collaborative automation in high stakes software engineering work: A case study at nasa,” inProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2021

  75. [75]

    RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

    Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024, pp. 4966–4974

  76. [76]

    Acto: Automatic end-to-end testing for operation correctness of cloud system management,

    J. T. Gu, X. Sun, W. Zhang, Y . Jiang, C. Wang, M. Vaziri, O. Legunsen, and T. Xu, “Acto: Automatic end-to-end testing for operation correctness of cloud system management,” inProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). ACM, 2023, pp. 96–112

  77. [77]

    Conveyor: One-Tool-Fits-All continuous software deployment at meta,

    B. Grubic, Y . Wang, T. Petrochko, R. Yaniv, B. Jones, D. Callies, M. Clarke-Lauer, D. Kelley, S. Demetriou, K. Yu, and C. Tang, “Conveyor: One-Tool-Fits-All continuous software deployment at meta,” in17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, 2023, pp. 325–342

  78. [78]

    Localized explanations for automatically synthesized network configurations,

    A. Nazari, Y . Zhang, M. Raghothaman, and H. Chen, “Localized explanations for automatically synthesized network configurations,” inProceedings of the 23rd ACM Workshop on Hot Topics in Networks (HotNets), 2024, pp. 52–59

  79. [79]

    Automatic configuration repair,

    X. Liu, P. Zhang, A. Abhashkumar, J. Chen, and W. Jiang, “Automatic configuration repair,” inProceedings of the 23rd ACM Workshop on Hot Topics in Networks (HotNets), 2024, pp. 213–220

  80. [80]

    Learning to generate structured output with schema reinforcement learning,

    Y . Lu, H. Li, X. Cong, Z. Zhang, Y . Wu, Y . Lin, Z. Liu, F. Liu, and M. Sun, “Learning to generate structured output with schema reinforcement learning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 4905–4918

Showing first 80 references.