pith. sign in

arxiv: 2606.26960 · v1 · pith:E4B6ANM2new · submitted 2026-06-25 · 💻 cs.NI

Toward Agentic SysAdmin: Rethinking System Administration with AI Agents

Pith reviewed 2026-06-26 02:22 UTC · model grok-4.3

classification 💻 cs.NI
keywords network administrationlarge language modelsagentic systemsevaluation frameworknetwork emulationsolver architecturesLLM benchmarking
0
0 comments X

The pith

Solver design lifts a 14B model's correctness on network tasks from 0.43 to 0.88 and lets it match trillion-parameter systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NetLLMeval, a framework that automatically evaluates LLM-based systems on network administration tasks by running live network emulations to produce ground truth without human intervention. It reports results from a full-factorial experiment of 24000 runs covering ten foundation models, four solver architectures, ten task types, and six network topologies. The central finding is that solver design exerts a large effect on accuracy. With the right configuration, smaller locally deployable models reach performance levels comparable to much larger frontier models.

Core claim

Through the NetLLMeval framework and its 24000-run study, solver architecture has a substantial impact on the accuracy of LLM systems performing network administration, raising a 14B open-weight model from 0.43 to 0.88 correctness and enabling locally deployable models to match trillion-parameter frontier systems under appropriate solver choices.

What carries the argument

NetLLMeval, a framework that derives ground truth via live network emulation to benchmark LLM solver architectures on network administration tasks.

If this is right

  • Solver choice, spanning monolithic prompting to fully agentic pipelines, determines whether smaller models reach high correctness on network tasks.
  • Locally deployable models can achieve parity with frontier systems when paired with suitable solver designs.
  • Automated emulation-based evaluation removes the need for static references or manual expert checks, enabling larger-scale benchmarking.
  • Performance differences appear across task types and topologies of increasing complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Engineering the solver pipeline may matter more than raw model scale for practical deployment in network management.
  • The framework could be extended to measure additional operational constraints such as latency, security posture, or recovery from failures.
  • Standardized emulation benchmarks might accelerate comparison of new agentic designs against existing approaches in systems administration.

Load-bearing premise

Live network emulation produces reliable ground truth for diverse real-world network states and orchestration strategies without fidelity gaps that would invalidate the correctness metric.

What would settle it

A side-by-side comparison of emulation-derived ground truth against results obtained from equivalent physical network hardware or from expert human validation on the same tasks and topologies.

Figures

Figures reproduced from arXiv: 2606.26960 by Alberto Castagnaro, Davide Saladino, Denis Donadel, Francesco Marchiori, Gianmaria Frigo, Luca Pajola, Mauro Conti.

Figure 1
Figure 1. Figure 1: NetLLMeval high-level architecture. on networking problems, establishing that general-purpose models benefit from domain adaptation. Related efforts have targeted 5G network analysis [13], security anomaly detec￾tion [14], and protocol bug identification [15], indicating broad applicability across the networking stack. Evaluation of LLMs in networking and operations. Sys￾tematic evaluation of LLMs on opera… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture comparison between our agentic solvers. (a) The Guided [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correctness ratio of every model–solver pair, with models ordered [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Correctness matrix for the best deployable configuration (Minis [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correctness rate of each model on each question task, using the best [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

The growing complexity of computer networks, driven by cloud-native architectures, heterogeneous devices, and distributed systems, places increasing pressure on network administrators who must simultaneously manage configuration, troubleshooting, and security under tight operational constraints. Large Language Models (LLMs) have emerged as a promising tool to assist and partially automate these tasks, yet their systematic evaluation in networking scenarios remains an open challenge. Existing benchmarks rely on static reference outputs or manual expert validation, neither of which scales to the diversity of real-world network states or to the variety of orchestration strategies -- from monolithic prompting to fully agentic pipelines --through which LLMs are increasingly deployed. In this paper, we present NetLLMeval, a framework for automatically evaluating LLM-based systems on network administration tasks by leveraging live network emulation to derive ground truth without human intervention. Through a full-factorial study of 24000 runs spanning 10 foundation models, 4 solver architectures, 10 task types, and 6 network topologies of increasing complexity, we show that solver design has a great impact on accuracy -- lifting a 14B open-weight model from 0.43 to 0.88 correctness -- and that such locally-deployable models can match trillion-parameter frontier systems under the right configuration. NetLLMeval is released open-source to support reproducible benchmarking of future models and solver designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces NetLLMeval, a framework for automatically evaluating LLM-based agentic systems on network administration tasks via live network emulation to derive ground truth without human intervention. It reports results from a full-factorial study of 24,000 runs across 10 foundation models, 4 solver architectures, 10 task types, and 6 network topologies, claiming that solver design has a large impact on accuracy (e.g., lifting a 14B open-weight model from 0.43 to 0.88 correctness) and that optimized locally-deployable models can match trillion-parameter frontier systems.

Significance. If the emulation faithfully reproduces real network behavior and the success predicates are valid, the work offers a scalable, reproducible benchmark for agentic networking systems that avoids the scalability limits of static or manual evaluation. The open-source release and emphasis on solver architecture over raw model scale are strengths that could influence future benchmarking practices in the field.

major comments (1)
  1. [Section 3 and evaluation protocol] Section 3 and the evaluation protocol: the correctness metric and all reported accuracy deltas (including the 0.43 to 0.88 lift and local-vs-frontier comparisons) are derived solely from automated success/failure labels produced by the emulator. No human validation, cross-check against production traces, or fidelity assessment is described, making the assumption that emulator outcomes equal operational correctness load-bearing for the central empirical claims.
minor comments (1)
  1. [Abstract and results presentation] The abstract states a 'full-factorial study of 24000 runs' but provides no visible error bars, exclusion criteria, or statistical controls in the high-level summary; these details should be added to the results section for interpretability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the evaluation protocol. We address the concern directly below and will revise the manuscript to improve transparency around the emulator's role.

read point-by-point responses
  1. Referee: [Section 3 and evaluation protocol] Section 3 and the evaluation protocol: the correctness metric and all reported accuracy deltas (including the 0.43 to 0.88 lift and local-vs-frontier comparisons) are derived solely from automated success/failure labels produced by the emulator. No human validation, cross-check against production traces, or fidelity assessment is described, making the assumption that emulator outcomes equal operational correctness load-bearing for the central empirical claims.

    Authors: We acknowledge that all reported metrics rely exclusively on automated emulator labels and that no human validation, production-trace cross-checks, or explicit fidelity assessment appears in the current manuscript. This is by design: the framework's primary goal is to support reproducible, large-scale evaluation (24,000 runs) that would be infeasible under manual labeling. We agree the emulator-fidelity assumption is load-bearing for the accuracy deltas and model comparisons. In the revision we will add to Section 3: (i) a precise description of the emulation stack and success predicates, (ii) any internal consistency checks performed during framework development, and (iii) an explicit limitations paragraph stating the assumption that emulator outcomes approximate operational correctness, together with a note that large-scale human validation remains future work. This change makes the assumption transparent without altering the experimental results. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no circular derivations

full rationale

The paper is a full-factorial empirical benchmarking study (24000 runs across models, solvers, tasks, and topologies) that measures LLM correctness against ground truth generated by live network emulation. No equations, fitted parameters, or self-referential definitions appear in the provided text; the accuracy deltas (e.g., 0.43 to 0.88) are computed directly from emulation outcomes rather than being forced by construction or prior self-citations. The work is therefore self-contained against external benchmarks with no load-bearing steps that reduce results to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5786 in / 1078 out tokens · 31604 ms · 2026-06-26T02:22:11.645909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 7 linked inside Pith

  1. [1]

    The evolution of distributed computing systems: from fundamental to new frontiers,

    D. Lindsay, S. S. Gill, D. Smirnova, and P. Garraghan, “The evolution of distributed computing systems: from fundamental to new frontiers,” Computing, vol. 103, no. 8, pp. 1859–1878, 2021

  2. [2]

    Unraveling the complexity of network management

    T. Benson, A. Akella, and D. A. Maltz, “Unraveling the complexity of network management.” inNSDI, 2009, pp. 335–348

  3. [3]

    Machine learning for networking: Workflow, advances and opportunities,

    M. Wang, Y . Cui, X. Wang, S. Xiao, and J. Jiang, “Machine learning for networking: Workflow, advances and opportunities,”Ieee Network, vol. 32, no. 2, pp. 92–99, 2017

  4. [4]

    Large language models for networking: Workflow, advances and challenges,

    C. Liu, X. Xie, X. Zhang, and Y . Cui, “Large language models for networking: Workflow, advances and challenges,” 2024. [Online]. Available: https://arxiv.org/abs/2404.12901

  5. [5]

    A review on large language models: Archi- tectures, applications, taxonomies, open issues and challenges,

    M. A. K. Raiaanet al., “A review on large language models: Archi- tectures, applications, taxonomies, open issues and challenges,”IEEE access, vol. 12, pp. 26 839–26 874, 2024

  6. [6]

    Can llms understand computer networks? towards a virtual system administrator,

    D. Donadel, F. Marchiori, L. Pajola, and M. Conti, “Can llms understand computer networks? towards a virtual system administrator,” in2024 IEEE 49th Conference on Local Computer Networks (LCN). IEEE, 2024, pp. 1–10

  7. [7]

    Netconfeval: Can llms facilitate network configuration?

    C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kosti ´c, and M. Chiesa, “Netconfeval: Can llms facilitate network configuration?” Proceedings of the ACM on Networking, vol. 2, no. CoNEXT2, pp. 1– 25, 2024

  8. [8]

    Aiopslab: A holistic framework to evaluate ai agents for enabling autonomous clouds,

    Y . Chenet al., “Aiopslab: A holistic framework to evaluate ai agents for enabling autonomous clouds,”Proceedings of Machine Learning and Systems, vol. 7, 2025

  9. [9]

    Netarena: Dynamic benchmarks for ai agents in network automation,

    Y . Zhouet al., “Netarena: Dynamic benchmarks for ai agents in network automation,” inThe Fourteenth International Conference on Learning Representations, 2026

  10. [10]

    A network arena for benchmarking ai agents on network troubleshoot- ing,

    Z. Wang, A. Cornacchia, A. Sacco, F. Galante, M. Canini, and D. Jiang, “A network arena for benchmarking ai agents on network troubleshoot- ing,”arXiv preprint arXiv:2512.16381, 2025

  11. [11]

    A comprehensive survey on llm-based network management and operations,

    J. Hong, N. V . Tu, and J. W.-K. Hong, “A comprehensive survey on llm-based network management and operations,”International Journal of Network Management, vol. 35, no. 6, p. e70029, 2025

  12. [12]

    Netllm: Adapting large language models for networking,

    D. Wuet al., “Netllm: Adapting large language models for networking,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 661– 678

  13. [13]

    Mobile-llama: Instruction fine-tuning open-source llm for network analysis in 5g networks,

    K. B. Kan, H. Mun, G. Cao, and Y . Lee, “Mobile-llama: Instruction fine-tuning open-source llm for network analysis in 5g networks,”IEEE Network, vol. 38, no. 5, pp. 76–83, 2024

  14. [14]

    A novel framework for detecting anomalies in network security using llm and deep learning,

    M. Entezami, S. Rahimi Harsini, D. Houshangi, and Z. Entezami, “A novel framework for detecting anomalies in network security using llm and deep learning,”Journal of Electrical Systems, vol. 21, no. 1s, pp. 294–302, 2025

  15. [15]

    An llm agent for functional bug detection in network protocols,

    M. Zheng, C. Wang, X. Liu, J. Guo, S. Feng, and X. Zhang, “An llm agent for functional bug detection in network protocols,”arXiv e-prints, pp. arXiv–2506, 2025

  16. [16]

    React: Synergizing reasoning and acting in language models,

    S. Yaoet al., “React: Synergizing reasoning and acting in language models,” 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

  17. [17]

    Lfm2 technical report,

    L. AI, “Lfm2 technical report,”arXiv preprint arXiv:2511.23404, 2025

  18. [18]

    Granite 4.0 language models,

    IBM Research, “Granite 4.0 language models,” https://github.com/ ibm-granite/granite-4.0-language-models, 2025, accessed: 2025-10-01

  19. [19]

    Nvidia nemotron 3: Efficient and open intelligence,

    A. Blakemanet al., “Nvidia nemotron 3: Efficient and open intelligence,” arXiv preprint arXiv:2512.20856, 2025

  20. [20]

    The llama 3 herd of models,

    A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  21. [21]

    Gemma 4: Frontier multimodal intelligence on device,

    Google DeepMind, “Gemma 4: Frontier multimodal intelligence on device,” Google AI, Technical Report, 2026. [Online]. Available: https://deepmind.google/models/gemma/gemma-4/

  22. [22]

    Rnj-1-5-Instruct,

    E. M. Callahanet al., “Rnj-1-5-Instruct,” 2026, long-context Instruction- tuned model release. [Online]. Available: https://huggingface.co/ EssentialAI/rnj-1-5-instruct

  23. [23]

    Qwen3.5: Towards native multimodal agents,

    Qwen Team, “Qwen3.5: Towards native multimodal agents,” February

  24. [24]

    Available: https://qwen.ai/blog?id=qwen3.5

    [Online]. Available: https://qwen.ai/blog?id=qwen3.5

  25. [25]

    Ministral 3,

    A. H. Liuet al., “Ministral 3,”arXiv preprint arXiv:2601.08584, 2026

  26. [26]

    Kimi k2.5: Visual agentic intelligence,

    K. Teamet al., “Kimi k2.5: Visual agentic intelligence,” 2026. [Online]. Available: https://arxiv.org/abs/2602.02276

  27. [27]

    Glm-5: from vibe coding to agentic engineering,

    GLM-5-Teamet al., “Glm-5: from vibe coding to agentic engineering,”

  28. [28]

    Available: https://arxiv.org/abs/2602.15763

    [Online]. Available: https://arxiv.org/abs/2602.15763

  29. [29]

    Kathar ´a: A container-based framework for implementing network function virtu- alization and software defined networks,

    G. Bonofiglio, V . Iovinella, G. Lospoto, and G. Di Battista, “Kathar ´a: A container-based framework for implementing network function virtu- alization and software defined networks,” inNOMS - 2018 IEEE/IFIP Network Operations and Management Symposium, 2018, pp. 1–9

  30. [30]

    Understanding the planning of llm agents: A survey,

    X. Huanget al., “Understanding the planning of llm agents: A survey,” arXiv preprint arXiv:2402.02716, 2024. Gianmaria Frigois currently pursuing the M.Sc. degree in Computer Engi- neering at the University of Padua, Italy, with a specialization in artificial in- telligence and robotics. He received the B.Sc. degree in Computer Engineering from the Univer...