Benchmarking LLM-Driven Network Configuration Repair
Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3
The pith
Large language models can repair network misconfigurations but frequently introduce new errors and perform worse on larger networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes Cornetto as the first benchmark for evaluating LLM-driven network configuration repair at scale. It features a pipeline that synthesizes representative misconfiguration scenarios across topologies with 20 to 754 nodes and an evaluation framework using formal verification to measure functional correctness against ground-truth specifications. Evaluation of nine state-of-the-art LLMs reveals that while they show promise, they often introduce regressions and their performance degrades at scale, indicating that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.
What carries the argument
Cornetto benchmark consisting of a misconfiguration synthesis pipeline and a formal verification-based evaluation framework that checks proposed fixes for functional correctness.
If this is right
- LLMs require integration into iterative workflows with formal verification to be reliable for network automation.
- Direct application of LLMs to large-scale network configurations risks introducing regressions.
- Performance of LLMs in fixing network errors decreases as the number of nodes and protocol complexity increases.
- The benchmark enables systematic testing of future LLM improvements in this domain.
Where Pith is reading between the lines
- Developers of network automation tools should prioritize hybrid systems that loop LLM suggestions through verifiers before deployment.
- Similar benchmarking approaches could be applied to other domains like cloud infrastructure or security policy repair.
- Real-world validation of the benchmark's scenarios would strengthen confidence in the observed LLM limitations.
Load-bearing premise
The generated misconfiguration scenarios reflect the types and distributions of errors found in actual production networks.
What would settle it
Applying the same LLMs directly to real operator-collected misconfigurations from production networks and observing no regressions or scale-related performance drops would challenge the claim that verification-guided workflows are required.
Figures
read the original abstract
There is a rapidly growing interest in using Large Language Models (LLMs) to automate complex network operations, but their reliable adoption requires rigorous assessment of their effectiveness and safety. Existing benchmarks do not address whether LLMs can successfully resolve errors in large-scale, interdependent network configurations without introducing new disruptions. Developing such a benchmark is challenging: scenarios must be diverse and increasingly complex, yet their evaluation must be straightforward and meaningful. In this paper, we present Cornetto, the first benchmark to evaluate LLM-driven network configuration repair functionally and at scale. Cornetto features a generation pipeline that synthesizes representative and plausible misconfiguration scenarios, coupled with an evaluation framework that uses formal verification to assess functional correctness of proposed fixes against ground-truth specifications. Using this pipeline, we synthesize a dataset of 231 problems for fixing configurations across varying network topologies (20--754 nodes) and diverse protocols. We evaluate 9 state-of-the-art LLMs and find that while they show promise, they often introduce regressions and their performance degrades at scale. Our results indicate that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Cornetto, the first benchmark for evaluating LLM-driven repair of network configurations. It describes a generation pipeline that produces 231 synthetic misconfiguration scenarios across topologies of 20–754 nodes and multiple protocols, paired with a formal-verification oracle that checks proposed fixes against ground-truth specifications. Evaluation of nine state-of-the-art LLMs shows that they frequently introduce regressions and that performance degrades with scale, leading to the conclusion that reliable LLM-powered network automation requires iterative workflows guided by formal verification.
Significance. If the synthetic scenarios prove representative, the work supplies the first large-scale, functionally verified evidence that current LLMs are prone to regressions in interdependent network settings and that formal verification must be integrated into any practical deployment. The explicit use of an external formal oracle rather than heuristic metrics is a methodological strength that could serve as a template for other automation benchmarks.
major comments (3)
- [Abstract and generation pipeline] Abstract and generation-pipeline description: the claim that the pipeline produces 'representative and plausible' misconfiguration scenarios is not accompanied by any quantitative comparison to real-world corpora (operator bug reports, NANOG archives, or configuration-change logs). Because the headline findings on regression rates and scaling behavior rest entirely on the 231 synthetic problems, this absence is load-bearing for the central claim.
- [Evaluation framework] Evaluation-framework description: while the manuscript states that formal verification assesses 'functional correctness of proposed fixes against ground-truth specifications,' the concrete properties checked (e.g., reachability invariants, protocol-specific constraints, ACL consistency) are not enumerated. Without these details it is impossible to judge whether the measured regression rates reflect genuine functional failures or merely the oracle's coverage.
- [Results] Results section on scaling: the reported degradation 'at scale' is presented without an explicit definition of scale (node count, protocol-interaction complexity, or both) or statistical tests on the regression-introduction rates. This weakens the quantitative support for the recommendation of iterative verification workflows.
minor comments (2)
- [Abstract and §1] The abstract and introduction use 'LLM' and 'ACL' without initial expansion; a single sentence defining each on first use would improve accessibility.
- [Figure 1] The pipeline diagram (presumably Figure 1) would benefit from explicit labels on each stage indicating input/output artifacts and the role of the formal verifier.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas for strengthening the manuscript's claims on representativeness, evaluation transparency, and scaling analysis. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and generation pipeline] Abstract and generation-pipeline description: the claim that the pipeline produces 'representative and plausible' misconfiguration scenarios is not accompanied by any quantitative comparison to real-world corpora (operator bug reports, NANOG archives, or configuration-change logs). Because the headline findings on regression rates and scaling behavior rest entirely on the 231 synthetic problems, this absence is load-bearing for the central claim.
Authors: We acknowledge that the absence of a quantitative comparison to real-world corpora limits the strength of the 'representative and plausible' claim. Publicly available, structured corpora of network misconfigurations with verified ground-truth fixes are extremely limited due to the proprietary nature of operator data. Our pipeline draws from documented patterns in the network literature, including common errors discussed in NANOG archives and studies on configuration management. To address this, we will add a dedicated subsection to the generation pipeline description that provides a qualitative mapping of our 231 scenarios to categories of real-world issues (e.g., routing policy errors, ACL inconsistencies) cited in prior work. This addition will support the plausibility argument while preserving the benchmark's controlled and verifiable nature. revision: partial
-
Referee: [Evaluation framework] Evaluation-framework description: while the manuscript states that formal verification assesses 'functional correctness of proposed fixes against ground-truth specifications,' the concrete properties checked (e.g., reachability invariants, protocol-specific constraints, ACL consistency) are not enumerated. Without these details it is impossible to judge whether the measured regression rates reflect genuine functional failures or merely the oracle's coverage.
Authors: We agree that explicit enumeration of the verified properties is necessary for readers to evaluate the oracle's coverage and the validity of the regression measurements. The full manuscript (Section 3.3) details that the formal verification oracle checks a comprehensive set of invariants derived from the ground-truth specifications. These include reachability between designated host pairs, absence of blackholes or forwarding loops, protocol convergence and preference constraints (for BGP and OSPF), ACL consistency and policy enforcement, and overall absence of policy violations. We will revise the evaluation framework section to include a clear table that enumerates each property, its formal definition, and the verification method used. This change will make the assessment of functional correctness fully transparent. revision: yes
-
Referee: [Results] Results section on scaling: the reported degradation 'at scale' is presented without an explicit definition of scale (node count, protocol-interaction complexity, or both) or statistical tests on the regression-introduction rates. This weakens the quantitative support for the recommendation of iterative verification workflows.
Authors: Scale in the manuscript is defined primarily by topology size (node count ranging from 20 to 754 nodes), which directly correlates with increased protocol interactions and configuration interdependencies. Results are already broken down by size bins to show the degradation trend across all nine LLMs. While formal statistical tests were not included in the initial version, the consistent pattern of increased regressions with larger topologies provides clear support for the findings. We will update the results section with an explicit definition of scale, additional analysis (including correlation between node count and regression rate), and basic trend quantification to strengthen the quantitative basis for recommending iterative verification workflows. revision: partial
Circularity Check
No circularity: empirical benchmark with external oracle
full rationale
The paper is a benchmarking study that defines a synthetic generation pipeline for misconfigurations, applies LLMs to repair them, and measures outcomes against an independent formal-verification oracle. No equations, derivations, or predictions are present that reduce to fitted inputs or self-referential definitions. Central claims rest on direct empirical measurements rather than any load-bearing self-citation chain or ansatz smuggling. The work is self-contained against its stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthesized misconfiguration scenarios are representative and plausible of real-world network errors.
Reference graph
Works this paper leans on
-
[1]
A general approach to network configuration analysis
Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 469– 483, Oakland, CA, May 2015. USENIX Association
work page 2015
-
[2]
A gen- eral approach to network configuration verification
Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. A gen- eral approach to network configuration verification. InProceedings of the Conference of the ACM Special Interest Group on Data Commu- nication, SIGCOMM ’17, page 155–168, New York, NY, USA, 2017. Association for Computing Machinery
work page 2017
-
[3]
Ahmed Khurshid, Xuan Zou, Wenxuan Zhou, Matthew Caesar, and P. Brighten Godfrey. VeriFlow: Verifying Network-Wide invariants in real time. In10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 15–27, Lombard, IL, April 2013. USENIX Association
work page 2013
-
[4]
NetComplete: Practical Network-Wide Configuration Syn- thesis with Autocompletion
Ahmed El-Hassany, Petar Tsankov, Laurent Vanbever, and Martin Vechev. NetComplete: Practical Network-Wide Configuration Syn- thesis with Autocompletion. InUSENIX NSDI’18, Renton, WA, USA, 2018
work page 2018
-
[5]
Don’t mind the gap: Bridging network-wide objectives and device-level configurations
Ryan Beckett, Ratul Mahajan, Todd Millstein, Jitu Padhye, and David Walker. Don’t mind the gap: Bridging network-wide objectives and device-level configurations. InSIGCOMM 2016, August 2016
work page 2016
-
[6]
Rudiger Birkner, Tobias Brodmann, Petar Tsankov, Laurent Vanbever, and Martin Vechev. Metha: Network verifiers need to be correct too! In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 99–113. USENIX Association, April 2021
work page 2021
-
[7]
Towards accessible model-free verification
Alexander Krentsel, Oliver Ye, Anthony Tafoya, Xuqian Ma, Sylvia Rat- nasamy, and Anees Shaikh. Towards accessible model-free verification. HotNets ’25, page 210–217, New York, NY, USA, 2025. Association for Computing Machinery
work page 2025
-
[8]
NetAssistant: Dialogue based network diagnosis in data center networks
Haopei Wang, Anubhavnidhi Abhashkumar, Changyu Lin, Tianrong Zhang, Xiaoming Gu, Ning Ma, Chang Wu, Songlin Liu, Wei Zhou, Yongbin Dong, Weirong Jiang, and Yi Wang. NetAssistant: Dialogue based network diagnosis in data center networks. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 2011–2024, Santa Clara, CA, Apri...
work page 2011
-
[9]
Towards llm-based failure localization in production-scale networks
Chenxu Wang, Xumiao Zhang, Runwei Lu, Xianshang Lin, Xuan Zeng, Xinlei Zhang, Zhe An, Gongwei Wu, Jiaqi Gao, Chen Tian, Guihai Chen, Guyue Liu, Yuhong Liao, Tao Lin, Dennis Cai, and Ennan Zhai. Towards llm-based failure localization in production-scale networks. InProceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 496–511, New York, NY, USA...
work page 2025
-
[10]
Intent-driven network manage- ment with multi-agent llms: The confucius framework
Zhaodong Wang, Samuel Lin, Guanqing Yan, Soudeh Ghorbani, Min- lan Yu, Jiawei Zhou, Nathan Hu, Lopa Baruah, Sam Peters, Srikanth Kamath, Jerry Yang, and Ying Zhang. Intent-driven network manage- ment with multi-agent llms: The confucius framework. InProceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 347–362, New York, NY, USA, 2025. Associa...
work page 2025
-
[11]
Google DeepMind. Gemini 3 Pro. https://deepmind.google/models/ gemini/pro/, 2025. Accessed: 2026-02-01
work page 2025
-
[12]
OpenAI. GPT-5 System Card. Technical report, OpenAI, 2025. Ac- cessed: 2026-02-01
work page 2025
-
[13]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar- garet Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar- garet Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery
work page 2021
-
[14]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023
work page 2023
-
[15]
Netconfeval: Can llms facilitate network configuration?Proc
Changjie Wang, Mariano Scazzariello, Alireza Farshin, Simone Ferlin, Dejan Kostić, and Marco Chiesa. Netconfeval: Can llms facilitate network configuration?Proc. ACM Netw., 2(CoNEXT2), June 2024
work page 2024
-
[16]
Netllmbench: A benchmark framework for large language models in network configu- ration tasks
Kaan Aykurt, Andreas Blenk, and Wolfgang Kellerer. Netllmbench: A benchmark framework for large language models in network configu- ration tasks. In2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), pages 1–6, 2024
work page 2024
-
[17]
A network arena for benchmarking ai agents on network troubleshooting, 2025
Zhihao Wang, Alessandro Cornacchia, Alessio Sacco, Franco Galante, Marco Canini, and Dingde Jiang. A network arena for benchmarking ai agents on network troubleshooting, 2025
work page 2025
-
[18]
Continual benchmarking of llm-based systems on networking operations
Ioannis Protogeros and Laurent Vanbever. Continual benchmarking of llm-based systems on networking operations. InProceedings of the ACM SIGCOMM 2025 Posters and Demos, ACM SIGCOMM Posters and Demos ’25, page 70–72, New York, NY, USA, 2025. Association for Computing Machinery
work page 2025
-
[19]
The emerging science of machine learning benchmarks
Moritz Hardt. The emerging science of machine learning benchmarks. Online at https://mlbenchmarks.org, 2025. Manuscript
work page 2025
- [20]
-
[21]
Config2spec: Mining network specifications from network configurations
Rüdiger Birkner, Dana Drachsler-Cohen, Laurent Vanbever, and Martin Vechev. Config2spec: Mining network specifications from network configurations. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, 2020
work page 2020
-
[22]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[23]
Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan
Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The internet topology zoo.IEEE Journal on Selected Areas in Communications, 29(9):1765–1775, 2011
work page 2011
-
[24]
Network configuration synthesis with abstract topolo- gies
Ryan Beckett, Ratul Mahajan, Todd Millstein, Jitendra Padhye, and David Walker. Network configuration synthesis with abstract topolo- gies. InProceedings of the 38th ACM SIGPLAN Conference on Program- ming Language Design and Implementation, PLDI 2017, page 437–451, New York, NY, USA, 2017. Association for Computing Machinery
work page 2017
-
[25]
Large language models can be easily distracted by irrelevant context, 2023
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context, 2023
work page 2023
-
[26]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How 13 Preprint, , Protogeros et al. language models use long contexts, 2023
work page 2023
-
[27]
Free Software Foundation, 2002
David Mackenzie, Paul Eggert, and Jim Meyering.Comparing and Merging Files with GNU Diff and Patch. Free Software Foundation, 2002
work page 2002
-
[28]
Diff-xyz: A benchmark for evaluating diff understanding, 2025
Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, and Alexander Bezzubov. Diff-xyz: A benchmark for evaluating diff understanding, 2025
work page 2025
-
[29]
SWE-bench: Can language models resolve real-world github issues? InICLR, 2024
Carlos E Jimenez, John Yang, et al. SWE-bench: Can language models resolve real-world github issues? InICLR, 2024
work page 2024
-
[30]
Aider: Ai pair programming in your terminal, 2023
Paul Gauthier. Aider: Ai pair programming in your terminal, 2023
work page 2023
-
[31]
Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals.Soviet Physics Doklady, 10:707, February 1966
work page 1966
-
[32]
Matharena: Evaluating llms on uncontaminated math competitions, 2026
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, 2026
work page 2026
-
[33]
Caip: Detecting router misconfigurations with context-aware iterative prompting of llms, 2024
Xi Jiang, Aaron Gember-Jacobson, and Nick Feamster. Caip: Detecting router misconfigurations with context-aware iterative prompting of llms, 2024
work page 2024
-
[34]
A holistic view of ai-driven network incident management
Pouya Hamadanian, Behnaz Arzani, Sadjad Fouladi, Siva Kesava Reddy Kakarla, Rodrigo Fonseca, Denizcan Billor, Ahmad Cheema, Edet Nkposong, and Ranveer Chandra. A holistic view of ai-driven network incident management. InProceedings of the 22nd ACM Workshop on Hot Topics in Networks, HotNets ’23, page 180–188, New York, NY, USA,
-
[35]
Association for Computing Machinery
-
[36]
Chain-of-thought prompting elicits reasoning in large language models, 2023
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023
work page 2023
-
[37]
Least-to-most prompting enables complex reasoning in large language models, 2023
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023
work page 2023
-
[38]
Information processing systems – Open Systems Interconnection – Basic Reference Model – Part 4: Management Framework, 1989
work page 1989
-
[39]
Small language models are the future of agentic ai, 2025
Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. Small language models are the future of agentic ai, 2025
work page 2025
-
[40]
Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023
work page 2023
-
[41]
Brighten Godfrey, and Samuel Talmadge King
Haohui Mai, Ahmed Khurshid, Rachit Agarwal, Matthew Caesar, P. Brighten Godfrey, and Samuel Talmadge King. Debugging the data plane with anteater.SIGCOMM Comput. Commun. Rev., 41(4):290–301, August 2011
work page 2011
-
[42]
Baxbench: Can llms generate correct and secure backends?, 2025
Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Max- imilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends?, 2025
work page 2025
-
[43]
Wang, Sadjad Fouladi, Francis Y
Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, and Zaoxing Liu. Netarena: Dynamic benchmarks for ai agents in network automation, 2026
work page 2026
-
[44]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[45]
Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021
work page 2021
-
[46]
React: Synergizing reasoning and acting in language models, 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023
work page 2023
-
[47]
Metagpt: Meta programming for a multi-agent collaborative framework, 2024
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024. 14 Benchmarking LLM-driven configuration repair Preprint, , A FAULT ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.