How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network
Pith reviewed 2026-05-20 02:29 UTC · model grok-4.3
The pith
An LLM chatbot received positive evaluations in 68.1 percent of cases while helping engineers build and run a 21-rack demonstration network.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that an LLM chatbot equipped with retrieval-augmented generation, CLI device control, and ticket-system access can provide measurable assistance during real network construction and operation. In the 21-rack demonstration environment, 105 engineers produced chat histories whose evaluations were positive in 68.1 percent of cases. The study further shows that clearer user understanding of the chatbot's capabilities improves response quality and supplies concrete examples of successful and unsuccessful interactions.
What carries the argument
The LLM-based chatbot with three external functions—retrieval-augmented generation for domain knowledge, direct CLI control of live network devices, and ticket-system lookup—that together allow the model to act inside an operational network rather than merely answer questions.
If this is right
- Engineers obtain clearer and more useful answers once they learn what the chatbot can and cannot do reliably.
- The chatbot's combination of knowledge retrieval, device control, and ticket access enables concrete assistance across configuration, troubleshooting, and documentation tasks.
- The 68.1 percent positive evaluation rate supplies a numerical starting point for measuring future improvements in LLM tools for network operations.
- Real chat logs reveal recurring patterns of successful and unsuccessful use that can guide both system design and user training.
Where Pith is reading between the lines
- Similar tool-augmented chatbots could be tested in production networks rather than demonstration settings to check whether the positive rate persists under stricter uptime constraints.
- The results point toward hybrid human-AI workflows in which the LLM handles routine queries while engineers retain final control of device commands.
- Quantitative baselines like this one may encourage operators to define clearer success metrics before deploying LLMs at scale.
- The same three-function pattern—knowledge lookup, direct control, and workflow integration—might transfer to other infrastructure domains such as server management or cloud orchestration.
Load-bearing premise
Self-reported ratings collected on a best-effort basis during live network work accurately capture the chatbot's helpfulness without distortion from fatigue, selection bias, or varying rating standards.
What would settle it
A follow-up trial that records objective task-completion times and error rates with and without the chatbot, or that collects ratings under blinded or standardized conditions, would show whether the 68.1 percent positive figure holds or shrinks substantially.
Figures
read the original abstract
This paper reports on a real-world case study in which over 100 network engineers assessed how a Large Language Model (LLM) can assist in building and operating a network. The versatility of LLMs has accelerated their adoption across a wide range of domains, and assisting network operations is one such promising application. LLMs are probabilistic models, unlike deterministic protocols and configurations; therefore, clarifying their capabilities -- how and to what extent LLMs can help in network operations -- is a crucial step toward adopting LLMs. To offer practical insights into this issue, we conducted an extensive experiment on a large demonstration network built for a public exhibition, consisting of 21 racks with heterogeneous network devices. In the experiment, a total of 105 network engineers used an LLM-based chatbot while building and operating the network. The chatbot was equipped with three external functions: retrieval-augmented generation for domain-specific knowledge, CLI control of network devices running on the network, and access to a ticket system. The participants gave evaluations for the chatbot's responses on a best-effort basis. Analysis of the chat histories shows that 68.1% of the evaluations were positive, indicating a quantitative baseline of the LLM's helpfulness in network operations. Our results also demonstrate that understanding the capabilities of the chatbot is important for eliciting better responses. Moreover, we provide detailed use case analyses while sharing actual user--chatbot interactions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a real-world case study in which 105 network engineers used an LLM-based chatbot (with RAG for domain knowledge, CLI device control, and ticket-system access) while building and operating a 21-rack heterogeneous demonstration network. Participants supplied best-effort evaluations of chatbot responses; analysis of the resulting chat histories yields a headline figure of 68.1% positive evaluations, which the authors present as a quantitative baseline for LLM helpfulness in network operations. The manuscript also supplies detailed use-case analyses and excerpts of actual user–chatbot dialogues.
Significance. If the reported positive-evaluation rate can be shown to be robust, the work supplies one of the first large-scale empirical baselines for LLM utility in live network operations. The scale (105 practitioners, 21-rack heterogeneous testbed, integrated tooling) and the emphasis on real deployment tasks distinguish it from purely simulated or small-scale studies and could usefully inform adoption decisions in operational networking.
major comments (1)
- [Abstract and Results] Abstract and Results section: The central claim that the study establishes a 'quantitative baseline' rests on the 68.1% positive-evaluation rate. This figure is obtained from self-reported ratings collected on a best-effort basis with no description of evaluation rubrics, mandatory participation, sampling frame, or checks for inter-rater consistency. Consequently the percentage cannot be treated as a stable, bias-controlled baseline without additional methodological detail or supplementary analysis.
minor comments (1)
- [Use-case analyses] The use-case analyses would benefit from explicit mapping of each example to the three external functions (RAG, CLI, ticket system) so readers can see which capability drove the observed outcome.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive criticism. We address the major comment on the robustness of our quantitative results below, and we are prepared to make revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: The central claim that the study establishes a 'quantitative baseline' rests on the 68.1% positive-evaluation rate. This figure is obtained from self-reported ratings collected on a best-effort basis with no description of evaluation rubrics, mandatory participation, sampling frame, or checks for inter-rater consistency. Consequently the percentage cannot be treated as a stable, bias-controlled baseline without additional methodological detail or supplementary analysis.
Authors: We appreciate the referee's concern regarding the methodological details supporting our reported positive-evaluation rate. As described in the manuscript, this was a real-world case study conducted during the building and operation of a 21-rack heterogeneous demonstration network for a public exhibition. The 105 network engineers were working under time constraints typical of such deployments, and evaluations were solicited on a best-effort basis to avoid interfering with their primary tasks. Participation was voluntary, and there was no enforced sampling frame or mandatory rating requirement, as the goal was to capture natural usage of the chatbot in an operational setting. We did not employ a detailed evaluation rubric beyond asking users to indicate whether the response was helpful in their specific task context, nor did we implement inter-rater consistency checks because each rating was provided by the end-user for their own interaction. We acknowledge that these aspects limit the generalizability and statistical robustness of the 68.1% figure. In response, we will revise the manuscript to: (1) add a new subsection in the Results or Discussion explicitly describing the data collection methodology and its limitations, (2) moderate the language in the abstract and results to present the 68.1% as an observed rate from this case study rather than a definitive 'quantitative baseline', and (3) include supplementary analysis if feasible, such as breakdown by task type. We believe these changes will address the referee's valid points while preserving the contribution of providing one of the first large-scale empirical observations from a live network operations environment. revision: yes
Circularity Check
Empirical case study reports observed ratings with no derivation or fitted predictions
full rationale
The paper is a purely observational case study of 105 engineers using an LLM chatbot during live network operations on a 21-rack testbed. The central quantitative claim (68.1% positive evaluations) is obtained by direct counting of self-reported ratings collected on a best-effort basis from chat histories. No equations, parameters, predictions, uniqueness theorems, or ansatzes appear; the result is not derived from any prior result by the same authors and does not reduce to a self-referential definition or fitted input. The analysis therefore contains no load-bearing circular steps of the enumerated kinds.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-reported evaluations collected on a best-effort basis during live network operations provide a valid measure of LLM helpfulness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Analysis of the chat histories shows that 68.1% of the evaluations were positive, indicating a quantitative baseline of the LLM's helpfulness in network operations.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The participants gave evaluations for the chatbot's responses on a best-effort basis.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scientific Reports15(1), 13755 (2025)
M. Raza, Z. Jahangir, M. B. Riaz, M. J. Saeed, and M. A. Sattar, “Industrial applications of large language models,”Scientific Reports, vol. 15, no. 1, p. 13755, Apr 2025. [Online]. Available: https://doi.org/10.1038/s41598-025-98483-1
-
[2]
Using an llm to help with code understanding,
D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery,
-
[3]
Hellendoorn, Bogdan Vasilescu, and Brad A
[Online]. Available: https://doi.org/10.1145/3597503.3639187
-
[4]
Software testing with large language models: Survey, landscape, and vision,
J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Trans. Softw. Eng., vol. 50, no. 4, p. 911–936, Apr. 2024. [Online]. Available: https://doi.org/10.1109/TSE.2024.3368208
-
[5]
A survey on large language models for software engineering,
Q. Zhang, C. Fang, Y . Xie, Y . Zhang, Y . Yang, W. Sun, S. Yu, and Z. Chen, “A survey on large language models for software engineering,”
-
[6]
Available: https://arxiv.org/abs/2312.15223
[Online]. Available: https://arxiv.org/abs/2312.15223
-
[7]
R. Mondal, A. Tang, R. Beckett, T. Millstein, and G. Varghese, “What do llms need to synthesize correct router configurations?” inProceedings of the 22nd ACM Workshop on Hot Topics in Networks, ser. HotNets ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 189–195. [Online]. Available: https://doi.org/10.1145/3626111.3628194
-
[8]
Netconfeval: Can llms facilitate network configuration?
C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kosti ´c, and M. Chiesa, “Netconfeval: Can llms facilitate network configuration?” Proc. ACM Netw., vol. 2, no. CoNEXT, Jun. 2024. [Online]. Available: https://doi.org/10.1145/3656296
-
[9]
Confagent: Towards intelligent network configuration via llm agent,
S. Li, Z. Gan, J. Liu, C. Gao, F. Li, S. Wu, P. Hu, and F. Li, “Confagent: Towards intelligent network configuration via llm agent,” in2025 IEEE/ACM 33rd International Symposium on Quality of Service (IWQoS), 2025, pp. 1–10
work page 2025
-
[10]
Intent-Based Networking - Concepts and Definitions,
A. Clemm, L. Ciavaglia, L. Z. Granville, and J. Tantsura, “Intent-Based Networking - Concepts and Definitions,” RFC 9315, Oct. 2022. [Online]. Available: https://www.rfc-editor.org/info/rfc9315
work page 2022
-
[11]
Intent-based management of next-generation networks: an llm-centric approach,
A. Mekrache, A. Ksentini, and C. Verikoukis, “Intent-based management of next-generation networks: an llm-centric approach,”IEEE Network, vol. 38, no. 5, pp. 29–36, 2024
work page 2024
-
[12]
N. Van Tu, J.-H. Yoo, and J. W.-K. Hong, “Towards intent-based config- uration for network function virtualization using in-context learning in large language models,” inNOMS 2024-2024 IEEE Network Operations and Management Symposium, 2024, pp. 1–8
work page 2024
-
[13]
Integrating llms with netbox and netmiko for vendor-agnostic intent- based networking,
L. I. Nickel, L. Hohmann, N. Stolbov, L. Gerstacker, and S. Rieger, “Integrating llms with netbox and netmiko for vendor-agnostic intent- based networking,” inNOMS 2025-2025 IEEE Network Operations and Management Symposium, 2025, pp. 1–6
work page 2025
-
[14]
Kpi assurance and llms for intent- based management,
K. Dzeparoska and A. Leon-Garcia, “Kpi assurance and llms for intent- based management,” inNOMS 2025-2025 IEEE Network Operations and Management Symposium, 2025, pp. 1–9
work page 2025
-
[15]
Can LLMs Understand Computer Networks? Towards a Virtual System Administrator,
D. Donadel, F. Marchiori, L. Pajola, and M. Conti, “Can LLMs Understand Computer Networks? Towards a Virtual System Administrator,” in2024 IEEE 49th Conference on Local Computer Networks (LCN). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2024, pp. 1–10. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/LCN60385.2024.10639641
-
[16]
Netpress: Dynamically generated llm benchmarks for network applications,
Y . Zhou, J. Ruan, E. S. Wang, S. Fouladi, F. Y . Yan, K. Hsieh, and Z. Liu, “Netpress: Dynamically generated llm benchmarks for network applications,” 2025. [Online]. Available: https://arxiv.org/abs/2506.03231
-
[17]
NetAssistant: Dialogue based network diagnosis in data center networks,
H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Dong, W. Jiang, and Y . Wang, “NetAssistant: Dialogue based network diagnosis in data center networks,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). Santa Clara, CA: USENIX Association, Apr. 2024, pp. 2011–2024. [Online]. Available: ht...
work page 2024
-
[18]
Towards llm-based failure localization in production-scale networks,
C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tian, G. Chen, G. Liu, Y . Liao, T. Lin, D. Cai, and E. Zhai, “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 496–511. [On...
-
[19]
ShowNet at Interop Tokyo: A Continuously Evolving Demonstration Network,
T. Tomine, R. Nakamura, and R. Motobayashi, “ShowNet at Interop Tokyo: A Continuously Evolving Demonstration Network,”The Internet Protocol Journal, vol. 28, no. 1, pp. 2–12, 2025. [Online]. Available: https://ipj.dreamhosters.com/wp-content/uploads/2025/04/281-ipj.pdf
work page 2025
-
[20]
“Interop Tokyo 2025.” [Online]. Available: https://www.interop.jp/2025/en/
work page 2025
-
[21]
Technology Highlights of ShowNet 2024,
R. Nakamura, H. Nakamura, K. Okada, and R. Kato, “Technology Highlights of ShowNet 2024,”The Internet Protocol Journal, vol. 28, no. 2, pp. 2–13, 2025. [Online]. Available: https://ipj.dreamhosters.com/wp-content/uploads/2025/08/282-ipj.pdf
work page 2024
-
[22]
Retrieval-augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associate...
work page 2020
-
[23]
Function Calling – OpenAI Platform
“Function Calling – OpenAI Platform.” [Online]. Available: https://platform.openai.com/docs/guides/function-calling
-
[24]
upa/llmexp-chatbot: Chatbot for our experiment: Assisting Network Operators with an LLM
“upa/llmexp-chatbot: Chatbot for our experiment: Assisting Network Operators with an LLM.” [Online]. Available: https://github.com/upa/llmexp-chatbot
-
[25]
How to use Azure OpenAI Assistants file search - Azure OpenAI — Microsoft Learn
“How to use Azure OpenAI Assistants file search - Azure OpenAI — Microsoft Learn.” [Online]. Available: https://learn.microsoft.com/en- us/azure/ai-foundry/openai/how-to/file-search
-
[26]
Chainlit/chainlit: Build Conversational AI in minutes,
“Chainlit/chainlit: Build Conversational AI in minutes,” 2025. [Online]. Available: https://github.com/Chainlit/chainlit
work page 2025
-
[27]
What is the Model Context Protocol (MCP)?
“What is the Model Context Protocol (MCP)?” [Online]. Available: https://modelcontextprotocol.io/docs/getting-started/intro
-
[28]
upa/mcp-netmiko-server: An MCP server that enables LLMs interacting with your network devices,
“upa/mcp-netmiko-server: An MCP server that enables LLMs interacting with your network devices,” 2025. [Online]. Available: https://github.com/upa/mcp-netmiko-server
work page 2025
-
[29]
“Diagram Syntax — Mermaid.” [Online]. Available: https://mermaid.js.org/intro/syntax-reference.html
-
[30]
Text tiling: Segmenting text into multi-paragraph subtopic passages,
M. A. Hearst, “Text tiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997. [Online]. Available: https://aclanthology.org/J97-1003/
work page 1997
-
[31]
Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring,
L. Xing and G. Carenini, “Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring,” inProceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, H. Li, G.-A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y . Wu, and J. J. Li, Eds. Singapore and Online: Association ...
work page 2021
-
[32]
Recent trends in linear text segmentation: A survey,
I. Ghinassi, L. Wang, C. Newell, and M. Purver, “Recent trends in linear text segmentation: A survey,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3084–3095. [Online]. Available: https://aclanthology....
work page 2024
-
[33]
Uncovering the potential of ChatGPT for discourse analysis in dialogue: An empirical study,
Y . Fan, F. Jiang, P. Li, and H. Li, “Uncovering the potential of ChatGPT for discourse analysis in dialogue: An empirical study,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Ita...
work page 2024
-
[34]
SWE-bench: Can language models resolve real-world github issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[35]
S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez, “The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=2GmDdhBdDk
work page 2025
-
[36]
Large Language Models are Zero-Shot Reasoners
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” 2023. [Online]. Available: https://arxiv.org/abs/2205.11916
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
netbox-community/netbox: The premier source of truth powering network automation
“netbox-community/netbox: The premier source of truth powering network automation.” [Online]. Available: https://github.com/netbox- community/netbox
-
[39]
An adaptable ai assistant for network management,
A. Abane, A. Battou, and M. Merzouki, “An adaptable ai assistant for network management,” inNOMS 2024-2024 IEEE Network Operations and Management Symposium, 2024, pp. 1–3
work page 2024
-
[40]
Exploring llm-based agents for root cause analysis,
D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 208–219. [Online]. Available: ...
-
[41]
Rca copilot: Transforming network data into actionable insights via large language models,
A. Shan, J. Kaur, R. Singh, T. Banka, R. Yavatkar, and T. Sridhar, “Rca copilot: Transforming network data into actionable insights via large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2507.03224
-
[42]
Netllmbench: A benchmark framework for large language models in network configuration tasks,
K. Aykurt, A. Blenk, and W. Kellerer, “Netllmbench: A benchmark framework for large language models in network configuration tasks,” in 2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), 2024, pp. 1–6
work page 2024
-
[43]
G. Bonofiglio, V . Iovinella, G. Lospoto, and G. Di Battista, “Kathar ´a: A container-based framework for implementing network function vir- tualization and software defined networks,” inNOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium, 2018, pp. 1–9. APPENDIX Figure 11 shows the original system prompt of the chat- bot developed for ...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.