How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network

Koshi Eguchi; Ryo Nakamura

arxiv: 2605.19627 · v1 · pith:QICNTCA4new · submitted 2026-05-19 · 💻 cs.NI

How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network

Ryo Nakamura , Koshi Eguchi This is my paper

Pith reviewed 2026-05-20 02:29 UTC · model grok-4.3

classification 💻 cs.NI

keywords LLM assistancenetwork operationschatbot evaluationcase studynetwork engineeringretrieval-augmented generationCLI controldemonstration network

0 comments

The pith

An LLM chatbot received positive evaluations in 68.1 percent of cases while helping engineers build and run a 21-rack demonstration network.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper describes a live experiment in which 105 network engineers worked on a large heterogeneous network using an LLM-based chatbot. The chatbot combined retrieval-augmented generation for technical knowledge, direct command-line control of devices, and ticket-system access. Participants rated responses as they built and operated the network, and analysis of the resulting chat logs showed 68.1 percent positive ratings. A sympathetic reader cares because the study supplies one of the first quantitative baselines for how probabilistic language models perform in a domain that normally demands deterministic precision. The work also notes that engineers obtained better results once they understood the chatbot's strengths and limits.

Core claim

The authors establish that an LLM chatbot equipped with retrieval-augmented generation, CLI device control, and ticket-system access can provide measurable assistance during real network construction and operation. In the 21-rack demonstration environment, 105 engineers produced chat histories whose evaluations were positive in 68.1 percent of cases. The study further shows that clearer user understanding of the chatbot's capabilities improves response quality and supplies concrete examples of successful and unsuccessful interactions.

What carries the argument

The LLM-based chatbot with three external functions—retrieval-augmented generation for domain knowledge, direct CLI control of live network devices, and ticket-system lookup—that together allow the model to act inside an operational network rather than merely answer questions.

If this is right

Engineers obtain clearer and more useful answers once they learn what the chatbot can and cannot do reliably.
The chatbot's combination of knowledge retrieval, device control, and ticket access enables concrete assistance across configuration, troubleshooting, and documentation tasks.
The 68.1 percent positive evaluation rate supplies a numerical starting point for measuring future improvements in LLM tools for network operations.
Real chat logs reveal recurring patterns of successful and unsuccessful use that can guide both system design and user training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tool-augmented chatbots could be tested in production networks rather than demonstration settings to check whether the positive rate persists under stricter uptime constraints.
The results point toward hybrid human-AI workflows in which the LLM handles routine queries while engineers retain final control of device commands.
Quantitative baselines like this one may encourage operators to define clearer success metrics before deploying LLMs at scale.
The same three-function pattern—knowledge lookup, direct control, and workflow integration—might transfer to other infrastructure domains such as server management or cloud orchestration.

Load-bearing premise

Self-reported ratings collected on a best-effort basis during live network work accurately capture the chatbot's helpfulness without distortion from fatigue, selection bias, or varying rating standards.

What would settle it

A follow-up trial that records objective task-completion times and error rates with and without the chatbot, or that collects ratings under blinded or standardized conditions, would show whether the 68.1 percent positive figure holds or shrinks substantially.

Figures

Figures reproduced from arXiv: 2605.19627 by Koshi Eguchi, Ryo Nakamura.

**Figure 2.** Figure 2: The chatbot system structure. A GPT-4.1 model de [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: shows the number of threads per participant. Overall, the junior engineers used the chatbot most frequently, followed by the NOC members in terms of thread count, and then the vendor specialists. The junior engineers actively asked the chatbot to obtain technical explanations and configuration guidance. This point is described in Section IV-D. Before analyzing the chat histories, we processed them to decom… view at source ↗

**Figure 4.** Figure 4: Thread and topic segment. A thread consists of multiple [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: CDF of the number of exchanges of prompts and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: A participant had the chatbot check the BGP status. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 6.** Figure 6: The chatbot successfully suggests the fix for a multi [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 9.** Figure 9: Proportion of segments classified into each category. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: A case of Knowledge Support. A junior engineer asks [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: The system prompt provided to the chatbot. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: The topology information embedded in the system [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

read the original abstract

This paper reports on a real-world case study in which over 100 network engineers assessed how a Large Language Model (LLM) can assist in building and operating a network. The versatility of LLMs has accelerated their adoption across a wide range of domains, and assisting network operations is one such promising application. LLMs are probabilistic models, unlike deterministic protocols and configurations; therefore, clarifying their capabilities -- how and to what extent LLMs can help in network operations -- is a crucial step toward adopting LLMs. To offer practical insights into this issue, we conducted an extensive experiment on a large demonstration network built for a public exhibition, consisting of 21 racks with heterogeneous network devices. In the experiment, a total of 105 network engineers used an LLM-based chatbot while building and operating the network. The chatbot was equipped with three external functions: retrieval-augmented generation for domain-specific knowledge, CLI control of network devices running on the network, and access to a ticket system. The participants gave evaluations for the chatbot's responses on a best-effort basis. Analysis of the chat histories shows that 68.1% of the evaluations were positive, indicating a quantitative baseline of the LLM's helpfulness in network operations. Our results also demonstrate that understanding the capabilities of the chatbot is important for eliciting better responses. Moreover, we provide detailed use case analyses while sharing actual user--chatbot interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A field study of 105 engineers using an LLM with RAG, CLI access, and ticket integration on a live 21-rack heterogeneous network, reporting 68.1% positive self-ratings from best-effort feedback.

read the letter

The main takeaway is that this paper supplies numbers from a sizable group of network engineers who actually used an LLM assistant while building and running a real demonstration network. The chatbot combined retrieval-augmented generation for domain knowledge, direct CLI commands on the devices, and access to a ticket system. Across the interactions, 68.1% of the ratings the participants gave were positive, and the authors include some concrete chat examples plus use-case breakdowns.

Referee Report

1 major / 1 minor

Summary. The paper reports a real-world case study in which 105 network engineers used an LLM-based chatbot (with RAG for domain knowledge, CLI device control, and ticket-system access) while building and operating a 21-rack heterogeneous demonstration network. Participants supplied best-effort evaluations of chatbot responses; analysis of the resulting chat histories yields a headline figure of 68.1% positive evaluations, which the authors present as a quantitative baseline for LLM helpfulness in network operations. The manuscript also supplies detailed use-case analyses and excerpts of actual user–chatbot dialogues.

Significance. If the reported positive-evaluation rate can be shown to be robust, the work supplies one of the first large-scale empirical baselines for LLM utility in live network operations. The scale (105 practitioners, 21-rack heterogeneous testbed, integrated tooling) and the emphasis on real deployment tasks distinguish it from purely simulated or small-scale studies and could usefully inform adoption decisions in operational networking.

major comments (1)

[Abstract and Results] Abstract and Results section: The central claim that the study establishes a 'quantitative baseline' rests on the 68.1% positive-evaluation rate. This figure is obtained from self-reported ratings collected on a best-effort basis with no description of evaluation rubrics, mandatory participation, sampling frame, or checks for inter-rater consistency. Consequently the percentage cannot be treated as a stable, bias-controlled baseline without additional methodological detail or supplementary analysis.

minor comments (1)

[Use-case analyses] The use-case analyses would benefit from explicit mapping of each example to the three external functions (RAG, CLI, ticket system) so readers can see which capability drove the observed outcome.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive criticism. We address the major comment on the robustness of our quantitative results below, and we are prepared to make revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The central claim that the study establishes a 'quantitative baseline' rests on the 68.1% positive-evaluation rate. This figure is obtained from self-reported ratings collected on a best-effort basis with no description of evaluation rubrics, mandatory participation, sampling frame, or checks for inter-rater consistency. Consequently the percentage cannot be treated as a stable, bias-controlled baseline without additional methodological detail or supplementary analysis.

Authors: We appreciate the referee's concern regarding the methodological details supporting our reported positive-evaluation rate. As described in the manuscript, this was a real-world case study conducted during the building and operation of a 21-rack heterogeneous demonstration network for a public exhibition. The 105 network engineers were working under time constraints typical of such deployments, and evaluations were solicited on a best-effort basis to avoid interfering with their primary tasks. Participation was voluntary, and there was no enforced sampling frame or mandatory rating requirement, as the goal was to capture natural usage of the chatbot in an operational setting. We did not employ a detailed evaluation rubric beyond asking users to indicate whether the response was helpful in their specific task context, nor did we implement inter-rater consistency checks because each rating was provided by the end-user for their own interaction. We acknowledge that these aspects limit the generalizability and statistical robustness of the 68.1% figure. In response, we will revise the manuscript to: (1) add a new subsection in the Results or Discussion explicitly describing the data collection methodology and its limitations, (2) moderate the language in the abstract and results to present the 68.1% as an observed rate from this case study rather than a definitive 'quantitative baseline', and (3) include supplementary analysis if feasible, such as breakdown by task type. We believe these changes will address the referee's valid points while preserving the contribution of providing one of the first large-scale empirical observations from a live network operations environment. revision: yes

Circularity Check

0 steps flagged

Empirical case study reports observed ratings with no derivation or fitted predictions

full rationale

The paper is a purely observational case study of 105 engineers using an LLM chatbot during live network operations on a 21-rack testbed. The central quantitative claim (68.1% positive evaluations) is obtained by direct counting of self-reported ratings collected on a best-effort basis from chat histories. No equations, parameters, predictions, uniqueness theorems, or ansatzes appear; the result is not derived from any prior result by the same authors and does not reduce to a self-referential definition or fitted input. The analysis therefore contains no load-bearing circular steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical user study with no mathematical free parameters or invented entities. It rests on standard domain assumptions about the reliability of self-reported feedback in operational settings.

axioms (1)

domain assumption Self-reported evaluations collected on a best-effort basis during live network operations provide a valid measure of LLM helpfulness.
The central quantitative result depends on this assumption about feedback quality and lack of bias.

pith-pipeline@v0.9.0 · 5782 in / 1268 out tokens · 49927 ms · 2026-05-20T02:29:41.608715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Analysis of the chat histories shows that 68.1% of the evaluations were positive, indicating a quantitative baseline of the LLM's helpfulness in network operations.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The participants gave evaluations for the chatbot's responses on a best-effort basis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

[1]

Scientific Reports15(1), 13755 (2025)

M. Raza, Z. Jahangir, M. B. Riaz, M. J. Saeed, and M. A. Sattar, “Industrial applications of large language models,”Scientific Reports, vol. 15, no. 1, p. 13755, Apr 2025. [Online]. Available: https://doi.org/10.1038/s41598-025-98483-1

work page doi:10.1038/s41598-025-98483-1 2025
[2]

Using an llm to help with code understanding,

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery,

work page
[3]

Hellendoorn, Bogdan Vasilescu, and Brad A

[Online]. Available: https://doi.org/10.1145/3597503.3639187

work page doi:10.1145/3597503.3639187
[4]

Software testing with large language models: Survey, landscape, and vision,

J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Trans. Softw. Eng., vol. 50, no. 4, p. 911–936, Apr. 2024. [Online]. Available: https://doi.org/10.1109/TSE.2024.3368208

work page doi:10.1109/tse.2024.3368208 2024
[5]

A survey on large language models for software engineering,

Q. Zhang, C. Fang, Y . Xie, Y . Zhang, Y . Yang, W. Sun, S. Yu, and Z. Chen, “A survey on large language models for software engineering,”

work page
[6]

Available: https://arxiv.org/abs/2312.15223

[Online]. Available: https://arxiv.org/abs/2312.15223

work page arXiv
[7]

Mogul and John Wilkes

R. Mondal, A. Tang, R. Beckett, T. Millstein, and G. Varghese, “What do llms need to synthesize correct router configurations?” inProceedings of the 22nd ACM Workshop on Hot Topics in Networks, ser. HotNets ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 189–195. [Online]. Available: https://doi.org/10.1145/3626111.3628194

work page doi:10.1145/3626111.3628194 2023
[8]

Netconfeval: Can llms facilitate network configuration?

C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kosti ´c, and M. Chiesa, “Netconfeval: Can llms facilitate network configuration?” Proc. ACM Netw., vol. 2, no. CoNEXT, Jun. 2024. [Online]. Available: https://doi.org/10.1145/3656296

work page doi:10.1145/3656296 2024
[9]

Confagent: Towards intelligent network configuration via llm agent,

S. Li, Z. Gan, J. Liu, C. Gao, F. Li, S. Wu, P. Hu, and F. Li, “Confagent: Towards intelligent network configuration via llm agent,” in2025 IEEE/ACM 33rd International Symposium on Quality of Service (IWQoS), 2025, pp. 1–10

work page 2025
[10]

Intent-Based Networking - Concepts and Definitions,

A. Clemm, L. Ciavaglia, L. Z. Granville, and J. Tantsura, “Intent-Based Networking - Concepts and Definitions,” RFC 9315, Oct. 2022. [Online]. Available: https://www.rfc-editor.org/info/rfc9315

work page 2022
[11]

Intent-based management of next-generation networks: an llm-centric approach,

A. Mekrache, A. Ksentini, and C. Verikoukis, “Intent-based management of next-generation networks: an llm-centric approach,”IEEE Network, vol. 38, no. 5, pp. 29–36, 2024

work page 2024
[12]

Towards intent-based config- uration for network function virtualization using in-context learning in large language models,

N. Van Tu, J.-H. Yoo, and J. W.-K. Hong, “Towards intent-based config- uration for network function virtualization using in-context learning in large language models,” inNOMS 2024-2024 IEEE Network Operations and Management Symposium, 2024, pp. 1–8

work page 2024
[13]

Integrating llms with netbox and netmiko for vendor-agnostic intent- based networking,

L. I. Nickel, L. Hohmann, N. Stolbov, L. Gerstacker, and S. Rieger, “Integrating llms with netbox and netmiko for vendor-agnostic intent- based networking,” inNOMS 2025-2025 IEEE Network Operations and Management Symposium, 2025, pp. 1–6

work page 2025
[14]

Kpi assurance and llms for intent- based management,

K. Dzeparoska and A. Leon-Garcia, “Kpi assurance and llms for intent- based management,” inNOMS 2025-2025 IEEE Network Operations and Management Symposium, 2025, pp. 1–9

work page 2025
[15]

Can LLMs Understand Computer Networks? Towards a Virtual System Administrator,

D. Donadel, F. Marchiori, L. Pajola, and M. Conti, “Can LLMs Understand Computer Networks? Towards a Virtual System Administrator,” in2024 IEEE 49th Conference on Local Computer Networks (LCN). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2024, pp. 1–10. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/LCN60385.2024.10639641

work page doi:10.1109/lcn60385.2024.10639641 2024
[16]

Netpress: Dynamically generated llm benchmarks for network applications,

Y . Zhou, J. Ruan, E. S. Wang, S. Fouladi, F. Y . Yan, K. Hsieh, and Z. Liu, “Netpress: Dynamically generated llm benchmarks for network applications,” 2025. [Online]. Available: https://arxiv.org/abs/2506.03231

work page arXiv 2025
[17]

NetAssistant: Dialogue based network diagnosis in data center networks,

H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Dong, W. Jiang, and Y . Wang, “NetAssistant: Dialogue based network diagnosis in data center networks,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). Santa Clara, CA: USENIX Association, Apr. 2024, pp. 2011–2024. [Online]. Available: ht...

work page 2024
[18]

Towards llm-based failure localization in production-scale networks,

C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tian, G. Chen, G. Liu, Y . Liao, T. Lin, D. Cai, and E. Zhai, “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 496–511. [On...

work page doi:10.1145/3718958.3750505 2025
[19]

ShowNet at Interop Tokyo: A Continuously Evolving Demonstration Network,

T. Tomine, R. Nakamura, and R. Motobayashi, “ShowNet at Interop Tokyo: A Continuously Evolving Demonstration Network,”The Internet Protocol Journal, vol. 28, no. 1, pp. 2–12, 2025. [Online]. Available: https://ipj.dreamhosters.com/wp-content/uploads/2025/04/281-ipj.pdf

work page 2025
[20]

Interop Tokyo 2025

“Interop Tokyo 2025.” [Online]. Available: https://www.interop.jp/2025/en/

work page 2025
[21]

Technology Highlights of ShowNet 2024,

R. Nakamura, H. Nakamura, K. Okada, and R. Kato, “Technology Highlights of ShowNet 2024,”The Internet Protocol Journal, vol. 28, no. 2, pp. 2–13, 2025. [Online]. Available: https://ipj.dreamhosters.com/wp-content/uploads/2025/08/282-ipj.pdf

work page 2024
[22]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associate...

work page 2020
[23]

Function Calling – OpenAI Platform

“Function Calling – OpenAI Platform.” [Online]. Available: https://platform.openai.com/docs/guides/function-calling

work page
[24]

upa/llmexp-chatbot: Chatbot for our experiment: Assisting Network Operators with an LLM

“upa/llmexp-chatbot: Chatbot for our experiment: Assisting Network Operators with an LLM.” [Online]. Available: https://github.com/upa/llmexp-chatbot

work page
[25]

How to use Azure OpenAI Assistants file search - Azure OpenAI — Microsoft Learn

“How to use Azure OpenAI Assistants file search - Azure OpenAI — Microsoft Learn.” [Online]. Available: https://learn.microsoft.com/en- us/azure/ai-foundry/openai/how-to/file-search

work page
[26]

Chainlit/chainlit: Build Conversational AI in minutes,

“Chainlit/chainlit: Build Conversational AI in minutes,” 2025. [Online]. Available: https://github.com/Chainlit/chainlit

work page 2025
[27]

What is the Model Context Protocol (MCP)?

“What is the Model Context Protocol (MCP)?” [Online]. Available: https://modelcontextprotocol.io/docs/getting-started/intro

work page
[28]

upa/mcp-netmiko-server: An MCP server that enables LLMs interacting with your network devices,

“upa/mcp-netmiko-server: An MCP server that enables LLMs interacting with your network devices,” 2025. [Online]. Available: https://github.com/upa/mcp-netmiko-server

work page 2025
[29]

Diagram Syntax — Mermaid

“Diagram Syntax — Mermaid.” [Online]. Available: https://mermaid.js.org/intro/syntax-reference.html

work page
[30]

Text tiling: Segmenting text into multi-paragraph subtopic passages,

M. A. Hearst, “Text tiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997. [Online]. Available: https://aclanthology.org/J97-1003/

work page 1997
[31]

Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring,

L. Xing and G. Carenini, “Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring,” inProceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, H. Li, G.-A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y . Wu, and J. J. Li, Eds. Singapore and Online: Association ...

work page 2021
[32]

Recent trends in linear text segmentation: A survey,

I. Ghinassi, L. Wang, C. Newell, and M. Purver, “Recent trends in linear text segmentation: A survey,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3084–3095. [Online]. Available: https://aclanthology....

work page 2024
[33]

Uncovering the potential of ChatGPT for discourse analysis in dialogue: An empirical study,

Y . Fan, F. Jiang, P. Li, and H. Li, “Uncovering the potential of ChatGPT for discourse analysis in dialogue: An empirical study,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Ita...

work page 2024
[34]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[35]

The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models,

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez, “The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=2GmDdhBdDk

work page 2025
[36]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” 2023. [Online]. Available: https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

netbox-community/netbox: The premier source of truth powering network automation

“netbox-community/netbox: The premier source of truth powering network automation.” [Online]. Available: https://github.com/netbox- community/netbox

work page
[39]

An adaptable ai assistant for network management,

A. Abane, A. Battou, and M. Merzouki, “An adaptable ai assistant for network management,” inNOMS 2024-2024 IEEE Network Operations and Management Symposium, 2024, pp. 1–3

work page 2024
[40]

Exploring llm-based agents for root cause analysis,

D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 208–219. [Online]. Available: ...

work page doi:10.1145/3663529.3663841 2024
[41]

Rca copilot: Transforming network data into actionable insights via large language models,

A. Shan, J. Kaur, R. Singh, T. Banka, R. Yavatkar, and T. Sridhar, “Rca copilot: Transforming network data into actionable insights via large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2507.03224

work page arXiv 2025
[42]

Netllmbench: A benchmark framework for large language models in network configuration tasks,

K. Aykurt, A. Blenk, and W. Kellerer, “Netllmbench: A benchmark framework for large language models in network configuration tasks,” in 2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), 2024, pp. 1–6

work page 2024
[43]

Kathar ´a: A container-based framework for implementing network function vir- tualization and software defined networks,

G. Bonofiglio, V . Iovinella, G. Lospoto, and G. Di Battista, “Kathar ´a: A container-based framework for implementing network function vir- tualization and software defined networks,” inNOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium, 2018, pp. 1–9. APPENDIX Figure 11 shows the original system prompt of the chat- bot developed for ...

work page 2018

[1] [1]

Scientific Reports15(1), 13755 (2025)

M. Raza, Z. Jahangir, M. B. Riaz, M. J. Saeed, and M. A. Sattar, “Industrial applications of large language models,”Scientific Reports, vol. 15, no. 1, p. 13755, Apr 2025. [Online]. Available: https://doi.org/10.1038/s41598-025-98483-1

work page doi:10.1038/s41598-025-98483-1 2025

[2] [2]

Using an llm to help with code understanding,

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery,

work page

[3] [3]

Hellendoorn, Bogdan Vasilescu, and Brad A

[Online]. Available: https://doi.org/10.1145/3597503.3639187

work page doi:10.1145/3597503.3639187

[4] [4]

Software testing with large language models: Survey, landscape, and vision,

J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Trans. Softw. Eng., vol. 50, no. 4, p. 911–936, Apr. 2024. [Online]. Available: https://doi.org/10.1109/TSE.2024.3368208

work page doi:10.1109/tse.2024.3368208 2024

[5] [5]

A survey on large language models for software engineering,

Q. Zhang, C. Fang, Y . Xie, Y . Zhang, Y . Yang, W. Sun, S. Yu, and Z. Chen, “A survey on large language models for software engineering,”

work page

[6] [6]

Available: https://arxiv.org/abs/2312.15223

[Online]. Available: https://arxiv.org/abs/2312.15223

work page arXiv

[7] [7]

Mogul and John Wilkes

R. Mondal, A. Tang, R. Beckett, T. Millstein, and G. Varghese, “What do llms need to synthesize correct router configurations?” inProceedings of the 22nd ACM Workshop on Hot Topics in Networks, ser. HotNets ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 189–195. [Online]. Available: https://doi.org/10.1145/3626111.3628194

work page doi:10.1145/3626111.3628194 2023

[8] [8]

Netconfeval: Can llms facilitate network configuration?

C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kosti ´c, and M. Chiesa, “Netconfeval: Can llms facilitate network configuration?” Proc. ACM Netw., vol. 2, no. CoNEXT, Jun. 2024. [Online]. Available: https://doi.org/10.1145/3656296

work page doi:10.1145/3656296 2024

[9] [9]

Confagent: Towards intelligent network configuration via llm agent,

S. Li, Z. Gan, J. Liu, C. Gao, F. Li, S. Wu, P. Hu, and F. Li, “Confagent: Towards intelligent network configuration via llm agent,” in2025 IEEE/ACM 33rd International Symposium on Quality of Service (IWQoS), 2025, pp. 1–10

work page 2025

[10] [10]

Intent-Based Networking - Concepts and Definitions,

A. Clemm, L. Ciavaglia, L. Z. Granville, and J. Tantsura, “Intent-Based Networking - Concepts and Definitions,” RFC 9315, Oct. 2022. [Online]. Available: https://www.rfc-editor.org/info/rfc9315

work page 2022

[11] [11]

Intent-based management of next-generation networks: an llm-centric approach,

A. Mekrache, A. Ksentini, and C. Verikoukis, “Intent-based management of next-generation networks: an llm-centric approach,”IEEE Network, vol. 38, no. 5, pp. 29–36, 2024

work page 2024

[12] [12]

Towards intent-based config- uration for network function virtualization using in-context learning in large language models,

N. Van Tu, J.-H. Yoo, and J. W.-K. Hong, “Towards intent-based config- uration for network function virtualization using in-context learning in large language models,” inNOMS 2024-2024 IEEE Network Operations and Management Symposium, 2024, pp. 1–8

work page 2024

[13] [13]

Integrating llms with netbox and netmiko for vendor-agnostic intent- based networking,

L. I. Nickel, L. Hohmann, N. Stolbov, L. Gerstacker, and S. Rieger, “Integrating llms with netbox and netmiko for vendor-agnostic intent- based networking,” inNOMS 2025-2025 IEEE Network Operations and Management Symposium, 2025, pp. 1–6

work page 2025

[14] [14]

Kpi assurance and llms for intent- based management,

K. Dzeparoska and A. Leon-Garcia, “Kpi assurance and llms for intent- based management,” inNOMS 2025-2025 IEEE Network Operations and Management Symposium, 2025, pp. 1–9

work page 2025

[15] [15]

Can LLMs Understand Computer Networks? Towards a Virtual System Administrator,

D. Donadel, F. Marchiori, L. Pajola, and M. Conti, “Can LLMs Understand Computer Networks? Towards a Virtual System Administrator,” in2024 IEEE 49th Conference on Local Computer Networks (LCN). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2024, pp. 1–10. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/LCN60385.2024.10639641

work page doi:10.1109/lcn60385.2024.10639641 2024

[16] [16]

Netpress: Dynamically generated llm benchmarks for network applications,

Y . Zhou, J. Ruan, E. S. Wang, S. Fouladi, F. Y . Yan, K. Hsieh, and Z. Liu, “Netpress: Dynamically generated llm benchmarks for network applications,” 2025. [Online]. Available: https://arxiv.org/abs/2506.03231

work page arXiv 2025

[17] [17]

NetAssistant: Dialogue based network diagnosis in data center networks,

H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Dong, W. Jiang, and Y . Wang, “NetAssistant: Dialogue based network diagnosis in data center networks,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). Santa Clara, CA: USENIX Association, Apr. 2024, pp. 2011–2024. [Online]. Available: ht...

work page 2024

[18] [18]

Towards llm-based failure localization in production-scale networks,

C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tian, G. Chen, G. Liu, Y . Liao, T. Lin, D. Cai, and E. Zhai, “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, ser. SIGCOMM ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 496–511. [On...

work page doi:10.1145/3718958.3750505 2025

[19] [19]

ShowNet at Interop Tokyo: A Continuously Evolving Demonstration Network,

T. Tomine, R. Nakamura, and R. Motobayashi, “ShowNet at Interop Tokyo: A Continuously Evolving Demonstration Network,”The Internet Protocol Journal, vol. 28, no. 1, pp. 2–12, 2025. [Online]. Available: https://ipj.dreamhosters.com/wp-content/uploads/2025/04/281-ipj.pdf

work page 2025

[20] [20]

Interop Tokyo 2025

“Interop Tokyo 2025.” [Online]. Available: https://www.interop.jp/2025/en/

work page 2025

[21] [21]

Technology Highlights of ShowNet 2024,

R. Nakamura, H. Nakamura, K. Okada, and R. Kato, “Technology Highlights of ShowNet 2024,”The Internet Protocol Journal, vol. 28, no. 2, pp. 2–13, 2025. [Online]. Available: https://ipj.dreamhosters.com/wp-content/uploads/2025/08/282-ipj.pdf

work page 2024

[22] [22]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associate...

work page 2020

[23] [23]

Function Calling – OpenAI Platform

“Function Calling – OpenAI Platform.” [Online]. Available: https://platform.openai.com/docs/guides/function-calling

work page

[24] [24]

upa/llmexp-chatbot: Chatbot for our experiment: Assisting Network Operators with an LLM

“upa/llmexp-chatbot: Chatbot for our experiment: Assisting Network Operators with an LLM.” [Online]. Available: https://github.com/upa/llmexp-chatbot

work page

[25] [25]

How to use Azure OpenAI Assistants file search - Azure OpenAI — Microsoft Learn

“How to use Azure OpenAI Assistants file search - Azure OpenAI — Microsoft Learn.” [Online]. Available: https://learn.microsoft.com/en- us/azure/ai-foundry/openai/how-to/file-search

work page

[26] [26]

Chainlit/chainlit: Build Conversational AI in minutes,

“Chainlit/chainlit: Build Conversational AI in minutes,” 2025. [Online]. Available: https://github.com/Chainlit/chainlit

work page 2025

[27] [27]

What is the Model Context Protocol (MCP)?

“What is the Model Context Protocol (MCP)?” [Online]. Available: https://modelcontextprotocol.io/docs/getting-started/intro

work page

[28] [28]

upa/mcp-netmiko-server: An MCP server that enables LLMs interacting with your network devices,

“upa/mcp-netmiko-server: An MCP server that enables LLMs interacting with your network devices,” 2025. [Online]. Available: https://github.com/upa/mcp-netmiko-server

work page 2025

[29] [29]

Diagram Syntax — Mermaid

“Diagram Syntax — Mermaid.” [Online]. Available: https://mermaid.js.org/intro/syntax-reference.html

work page

[30] [30]

Text tiling: Segmenting text into multi-paragraph subtopic passages,

M. A. Hearst, “Text tiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997. [Online]. Available: https://aclanthology.org/J97-1003/

work page 1997

[31] [31]

Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring,

L. Xing and G. Carenini, “Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring,” inProceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, H. Li, G.-A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y . Wu, and J. J. Li, Eds. Singapore and Online: Association ...

work page 2021

[32] [32]

Recent trends in linear text segmentation: A survey,

I. Ghinassi, L. Wang, C. Newell, and M. Purver, “Recent trends in linear text segmentation: A survey,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 3084–3095. [Online]. Available: https://aclanthology....

work page 2024

[33] [33]

Uncovering the potential of ChatGPT for discourse analysis in dialogue: An empirical study,

Y . Fan, F. Jiang, P. Li, and H. Li, “Uncovering the potential of ChatGPT for discourse analysis in dialogue: An empirical study,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Ita...

work page 2024

[34] [34]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

work page 2024

[35] [35]

The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models,

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez, “The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=2GmDdhBdDk

work page 2025

[36] [36]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” 2023. [Online]. Available: https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

netbox-community/netbox: The premier source of truth powering network automation

“netbox-community/netbox: The premier source of truth powering network automation.” [Online]. Available: https://github.com/netbox- community/netbox

work page

[39] [39]

An adaptable ai assistant for network management,

A. Abane, A. Battou, and M. Merzouki, “An adaptable ai assistant for network management,” inNOMS 2024-2024 IEEE Network Operations and Management Symposium, 2024, pp. 1–3

work page 2024

[40] [40]

Exploring llm-based agents for root cause analysis,

D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 208–219. [Online]. Available: ...

work page doi:10.1145/3663529.3663841 2024

[41] [41]

Rca copilot: Transforming network data into actionable insights via large language models,

A. Shan, J. Kaur, R. Singh, T. Banka, R. Yavatkar, and T. Sridhar, “Rca copilot: Transforming network data into actionable insights via large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2507.03224

work page arXiv 2025

[42] [42]

Netllmbench: A benchmark framework for large language models in network configuration tasks,

K. Aykurt, A. Blenk, and W. Kellerer, “Netllmbench: A benchmark framework for large language models in network configuration tasks,” in 2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), 2024, pp. 1–6

work page 2024

[43] [43]

Kathar ´a: A container-based framework for implementing network function vir- tualization and software defined networks,

G. Bonofiglio, V . Iovinella, G. Lospoto, and G. Di Battista, “Kathar ´a: A container-based framework for implementing network function vir- tualization and software defined networks,” inNOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium, 2018, pp. 1–9. APPENDIX Figure 11 shows the original system prompt of the chat- bot developed for ...

work page 2018