pith. sign in

arxiv: 2604.22513 · v1 · submitted 2026-04-24 · 💻 cs.NI

Benchmarking LLM-Driven Network Configuration Repair

Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3

classification 💻 cs.NI
keywords large language modelsnetwork configuration repairbenchmarkformal verificationmisconfigurationnetwork automationLLM evaluationconfiguration errors
0
0 comments X

The pith

Large language models can repair network misconfigurations but frequently introduce new errors and perform worse on larger networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a benchmark to test whether large language models can automatically fix errors in complex network configurations without creating new disruptions. It generates synthetic but plausible misconfiguration scenarios and relies on formal verification to determine whether proposed fixes meet the original specifications. Testing nine LLMs across 231 problems on topologies ranging from 20 to 754 nodes shows that the models sometimes succeed yet often cause regressions, with success rates declining as network size and interdependence grow. A reader would care because network operations are safety-critical, and the results indicate that standalone LLM use is insufficient for reliable automation.

Core claim

The paper establishes Cornetto as the first benchmark for evaluating LLM-driven network configuration repair at scale. It features a pipeline that synthesizes representative misconfiguration scenarios across topologies with 20 to 754 nodes and an evaluation framework using formal verification to measure functional correctness against ground-truth specifications. Evaluation of nine state-of-the-art LLMs reveals that while they show promise, they often introduce regressions and their performance degrades at scale, indicating that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.

What carries the argument

Cornetto benchmark consisting of a misconfiguration synthesis pipeline and a formal verification-based evaluation framework that checks proposed fixes for functional correctness.

If this is right

  • LLMs require integration into iterative workflows with formal verification to be reliable for network automation.
  • Direct application of LLMs to large-scale network configurations risks introducing regressions.
  • Performance of LLMs in fixing network errors decreases as the number of nodes and protocol complexity increases.
  • The benchmark enables systematic testing of future LLM improvements in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of network automation tools should prioritize hybrid systems that loop LLM suggestions through verifiers before deployment.
  • Similar benchmarking approaches could be applied to other domains like cloud infrastructure or security policy repair.
  • Real-world validation of the benchmark's scenarios would strengthen confidence in the observed LLM limitations.

Load-bearing premise

The generated misconfiguration scenarios reflect the types and distributions of errors found in actual production networks.

What would settle it

Applying the same LLMs directly to real operator-collected misconfigurations from production networks and observing no regressions or scale-related performance drops would challenge the claim that verification-guided workflows are required.

Figures

Figures reproduced from arXiv: 2604.22513 by Benjamin Hoffman, Ioannis Protogeros, Laurent Vanbever, Rufat Asadli.

Figure 1
Figure 1. Figure 1: Cornetto architecture. (I) The Dataset Generation pipeline coordinates the scenarios to ensure a diverse and complex test suite, generates sensible configurations and misconfigurations, and provides a standardized problem definition. (II) The Evaluation Framework enables automated and meaningful evaluation of the created scenarios by validating the reconfigured network’s behaviour against ground-truth spec… view at source ↗
Figure 1
Figure 1. Figure 1: The Dataset Generation (§3) pipeline accepts a topol￾ogy collection and a fault library to generate a minimal yet diverse suite of misconfiguration scenarios, along with their problem formulations. With the generated scenarios, the Evaluation Framework (§4) assesses repair capabilities by verifying configurations against a ground-truth specification. Scenario coordination and creation (§3.2). To drive faul… view at source ↗
Figure 2
Figure 2. Figure 2: While configuration perturbations are min view at source ↗
Figure 3
Figure 3. Figure 3: For each unique destination prefix, the pipeline uses the forwarding behaviour table (calcu￾lated by Batfish) to construct a forwarding graph, from which it derives the specifications of the network. • The set of regressions as originally healthy specifica￾tions that are violated by the fix Φregressed = {𝜙 ∈ (Φ \ V) | 𝐶fix ̸|= 𝜙} • The set of violations that remained unresolved: Φunfixed = V \ Φfixed And w… view at source ↗
Figure 4
Figure 4. Figure 4: Frontier LLMs benefit from access to global view at source ↗
Figure 5
Figure 5. Figure 5: Cornetto is not saturated with either impos￾sible or trivial tasks view at source ↗
Figure 6
Figure 6. Figure 6: Diagnostic accuracy is necessary but not suf view at source ↗
Figure 7
Figure 7. Figure 7: Repair performance consistently degrades view at source ↗
Figure 8
Figure 8. Figure 8: Models struggle to handle concurrent fail view at source ↗
Figure 9
Figure 9. Figure 9: While some models remain robust, many per view at source ↗
Figure 10
Figure 10. Figure 10: Overview of the model leaderboard using other core performance metrics: diagnosis (left) and localization view at source ↗
Figure 11
Figure 11. Figure 11: Diagnosis performance (left) consistently degrades with increasing input prompt tokens. The same view at source ↗
Figure 12
Figure 12. Figure 12: Cost-Pareto frontier with respect to average fix score (left) and regression rate (right). view at source ↗
read the original abstract

There is a rapidly growing interest in using Large Language Models (LLMs) to automate complex network operations, but their reliable adoption requires rigorous assessment of their effectiveness and safety. Existing benchmarks do not address whether LLMs can successfully resolve errors in large-scale, interdependent network configurations without introducing new disruptions. Developing such a benchmark is challenging: scenarios must be diverse and increasingly complex, yet their evaluation must be straightforward and meaningful. In this paper, we present Cornetto, the first benchmark to evaluate LLM-driven network configuration repair functionally and at scale. Cornetto features a generation pipeline that synthesizes representative and plausible misconfiguration scenarios, coupled with an evaluation framework that uses formal verification to assess functional correctness of proposed fixes against ground-truth specifications. Using this pipeline, we synthesize a dataset of 231 problems for fixing configurations across varying network topologies (20--754 nodes) and diverse protocols. We evaluate 9 state-of-the-art LLMs and find that while they show promise, they often introduce regressions and their performance degrades at scale. Our results indicate that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Cornetto, the first benchmark for evaluating LLM-driven repair of network configurations. It describes a generation pipeline that produces 231 synthetic misconfiguration scenarios across topologies of 20–754 nodes and multiple protocols, paired with a formal-verification oracle that checks proposed fixes against ground-truth specifications. Evaluation of nine state-of-the-art LLMs shows that they frequently introduce regressions and that performance degrades with scale, leading to the conclusion that reliable LLM-powered network automation requires iterative workflows guided by formal verification.

Significance. If the synthetic scenarios prove representative, the work supplies the first large-scale, functionally verified evidence that current LLMs are prone to regressions in interdependent network settings and that formal verification must be integrated into any practical deployment. The explicit use of an external formal oracle rather than heuristic metrics is a methodological strength that could serve as a template for other automation benchmarks.

major comments (3)
  1. [Abstract and generation pipeline] Abstract and generation-pipeline description: the claim that the pipeline produces 'representative and plausible' misconfiguration scenarios is not accompanied by any quantitative comparison to real-world corpora (operator bug reports, NANOG archives, or configuration-change logs). Because the headline findings on regression rates and scaling behavior rest entirely on the 231 synthetic problems, this absence is load-bearing for the central claim.
  2. [Evaluation framework] Evaluation-framework description: while the manuscript states that formal verification assesses 'functional correctness of proposed fixes against ground-truth specifications,' the concrete properties checked (e.g., reachability invariants, protocol-specific constraints, ACL consistency) are not enumerated. Without these details it is impossible to judge whether the measured regression rates reflect genuine functional failures or merely the oracle's coverage.
  3. [Results] Results section on scaling: the reported degradation 'at scale' is presented without an explicit definition of scale (node count, protocol-interaction complexity, or both) or statistical tests on the regression-introduction rates. This weakens the quantitative support for the recommendation of iterative verification workflows.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction use 'LLM' and 'ACL' without initial expansion; a single sentence defining each on first use would improve accessibility.
  2. [Figure 1] The pipeline diagram (presumably Figure 1) would benefit from explicit labels on each stage indicating input/output artifacts and the role of the formal verifier.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for strengthening the manuscript's claims on representativeness, evaluation transparency, and scaling analysis. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and generation pipeline] Abstract and generation-pipeline description: the claim that the pipeline produces 'representative and plausible' misconfiguration scenarios is not accompanied by any quantitative comparison to real-world corpora (operator bug reports, NANOG archives, or configuration-change logs). Because the headline findings on regression rates and scaling behavior rest entirely on the 231 synthetic problems, this absence is load-bearing for the central claim.

    Authors: We acknowledge that the absence of a quantitative comparison to real-world corpora limits the strength of the 'representative and plausible' claim. Publicly available, structured corpora of network misconfigurations with verified ground-truth fixes are extremely limited due to the proprietary nature of operator data. Our pipeline draws from documented patterns in the network literature, including common errors discussed in NANOG archives and studies on configuration management. To address this, we will add a dedicated subsection to the generation pipeline description that provides a qualitative mapping of our 231 scenarios to categories of real-world issues (e.g., routing policy errors, ACL inconsistencies) cited in prior work. This addition will support the plausibility argument while preserving the benchmark's controlled and verifiable nature. revision: partial

  2. Referee: [Evaluation framework] Evaluation-framework description: while the manuscript states that formal verification assesses 'functional correctness of proposed fixes against ground-truth specifications,' the concrete properties checked (e.g., reachability invariants, protocol-specific constraints, ACL consistency) are not enumerated. Without these details it is impossible to judge whether the measured regression rates reflect genuine functional failures or merely the oracle's coverage.

    Authors: We agree that explicit enumeration of the verified properties is necessary for readers to evaluate the oracle's coverage and the validity of the regression measurements. The full manuscript (Section 3.3) details that the formal verification oracle checks a comprehensive set of invariants derived from the ground-truth specifications. These include reachability between designated host pairs, absence of blackholes or forwarding loops, protocol convergence and preference constraints (for BGP and OSPF), ACL consistency and policy enforcement, and overall absence of policy violations. We will revise the evaluation framework section to include a clear table that enumerates each property, its formal definition, and the verification method used. This change will make the assessment of functional correctness fully transparent. revision: yes

  3. Referee: [Results] Results section on scaling: the reported degradation 'at scale' is presented without an explicit definition of scale (node count, protocol-interaction complexity, or both) or statistical tests on the regression-introduction rates. This weakens the quantitative support for the recommendation of iterative verification workflows.

    Authors: Scale in the manuscript is defined primarily by topology size (node count ranging from 20 to 754 nodes), which directly correlates with increased protocol interactions and configuration interdependencies. Results are already broken down by size bins to show the degradation trend across all nine LLMs. While formal statistical tests were not included in the initial version, the consistent pattern of increased regressions with larger topologies provides clear support for the findings. We will update the results section with an explicit definition of scale, additional analysis (including correlation between node count and regression rate), and basic trend quantification to strengthen the quantitative basis for recommending iterative verification workflows. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external oracle

full rationale

The paper is a benchmarking study that defines a synthetic generation pipeline for misconfigurations, applies LLMs to repair them, and measures outcomes against an independent formal-verification oracle. No equations, derivations, or predictions are present that reduce to fitted inputs or self-referential definitions. Central claims rest on direct empirical measurements rather than any load-bearing self-citation chain or ansatz smuggling. The work is self-contained against its stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the generated scenarios are representative of real misconfigurations and that formal verification accurately captures functional correctness without false positives or negatives.

axioms (1)
  • domain assumption Synthesized misconfiguration scenarios are representative and plausible of real-world network errors.
    The generation pipeline is described as producing 'representative and plausible' scenarios, but the abstract provides no validation against operator traces or real failure data.

pith-pipeline@v0.9.0 · 5499 in / 1383 out tokens · 57001 ms · 2026-05-08T09:47:46.532716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    A general approach to network configuration analysis

    Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 469– 483, Oakland, CA, May 2015. USENIX Association

  2. [2]

    A gen- eral approach to network configuration verification

    Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. A gen- eral approach to network configuration verification. InProceedings of the Conference of the ACM Special Interest Group on Data Commu- nication, SIGCOMM ’17, page 155–168, New York, NY, USA, 2017. Association for Computing Machinery

  3. [3]

    Brighten Godfrey

    Ahmed Khurshid, Xuan Zou, Wenxuan Zhou, Matthew Caesar, and P. Brighten Godfrey. VeriFlow: Verifying Network-Wide invariants in real time. In10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 15–27, Lombard, IL, April 2013. USENIX Association

  4. [4]

    NetComplete: Practical Network-Wide Configuration Syn- thesis with Autocompletion

    Ahmed El-Hassany, Petar Tsankov, Laurent Vanbever, and Martin Vechev. NetComplete: Practical Network-Wide Configuration Syn- thesis with Autocompletion. InUSENIX NSDI’18, Renton, WA, USA, 2018

  5. [5]

    Don’t mind the gap: Bridging network-wide objectives and device-level configurations

    Ryan Beckett, Ratul Mahajan, Todd Millstein, Jitu Padhye, and David Walker. Don’t mind the gap: Bridging network-wide objectives and device-level configurations. InSIGCOMM 2016, August 2016

  6. [6]

    Metha: Network verifiers need to be correct too! In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 99–113

    Rudiger Birkner, Tobias Brodmann, Petar Tsankov, Laurent Vanbever, and Martin Vechev. Metha: Network verifiers need to be correct too! In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 99–113. USENIX Association, April 2021

  7. [7]

    Towards accessible model-free verification

    Alexander Krentsel, Oliver Ye, Anthony Tafoya, Xuqian Ma, Sylvia Rat- nasamy, and Anees Shaikh. Towards accessible model-free verification. HotNets ’25, page 210–217, New York, NY, USA, 2025. Association for Computing Machinery

  8. [8]

    NetAssistant: Dialogue based network diagnosis in data center networks

    Haopei Wang, Anubhavnidhi Abhashkumar, Changyu Lin, Tianrong Zhang, Xiaoming Gu, Ning Ma, Chang Wu, Songlin Liu, Wei Zhou, Yongbin Dong, Weirong Jiang, and Yi Wang. NetAssistant: Dialogue based network diagnosis in data center networks. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 2011–2024, Santa Clara, CA, Apri...

  9. [9]

    Towards llm-based failure localization in production-scale networks

    Chenxu Wang, Xumiao Zhang, Runwei Lu, Xianshang Lin, Xuan Zeng, Xinlei Zhang, Zhe An, Gongwei Wu, Jiaqi Gao, Chen Tian, Guihai Chen, Guyue Liu, Yuhong Liao, Tao Lin, Dennis Cai, and Ennan Zhai. Towards llm-based failure localization in production-scale networks. InProceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 496–511, New York, NY, USA...

  10. [10]

    Intent-driven network manage- ment with multi-agent llms: The confucius framework

    Zhaodong Wang, Samuel Lin, Guanqing Yan, Soudeh Ghorbani, Min- lan Yu, Jiawei Zhou, Nathan Hu, Lopa Baruah, Sam Peters, Srikanth Kamath, Jerry Yang, and Ying Zhang. Intent-driven network manage- ment with multi-agent llms: The confucius framework. InProceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 347–362, New York, NY, USA, 2025. Associa...

  11. [11]

    Gemini 3 Pro

    Google DeepMind. Gemini 3 Pro. https://deepmind.google/models/ gemini/pro/, 2025. Accessed: 2026-02-01

  12. [12]

    GPT-5 System Card

    OpenAI. GPT-5 System Card. Technical report, OpenAI, 2025. Ac- cessed: 2026-02-01

  13. [13]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar- garet Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar- garet Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery

  14. [14]

    Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

  15. [15]

    Netconfeval: Can llms facilitate network configuration?Proc

    Changjie Wang, Mariano Scazzariello, Alireza Farshin, Simone Ferlin, Dejan Kostić, and Marco Chiesa. Netconfeval: Can llms facilitate network configuration?Proc. ACM Netw., 2(CoNEXT2), June 2024

  16. [16]

    Netllmbench: A benchmark framework for large language models in network configu- ration tasks

    Kaan Aykurt, Andreas Blenk, and Wolfgang Kellerer. Netllmbench: A benchmark framework for large language models in network configu- ration tasks. In2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), pages 1–6, 2024

  17. [17]

    A network arena for benchmarking ai agents on network troubleshooting, 2025

    Zhihao Wang, Alessandro Cornacchia, Alessio Sacco, Franco Galante, Marco Canini, and Dingde Jiang. A network arena for benchmarking ai agents on network troubleshooting, 2025

  18. [18]

    Continual benchmarking of llm-based systems on networking operations

    Ioannis Protogeros and Laurent Vanbever. Continual benchmarking of llm-based systems on networking operations. InProceedings of the ACM SIGCOMM 2025 Posters and Demos, ACM SIGCOMM Posters and Demos ’25, page 70–72, New York, NY, USA, 2025. Association for Computing Machinery

  19. [19]

    The emerging science of machine learning benchmarks

    Moritz Hardt. The emerging science of machine learning benchmarks. Online at https://mlbenchmarks.org, 2025. Manuscript

  20. [20]

    Kuhn, D.R

    D.R. Kuhn, D.R. Wallace, and A.M. Gallo. Software fault interactions and implications for software testing.IEEE Transactions on Software Engineering, 30(6):418–421, 2004

  21. [21]

    Config2spec: Mining network specifications from network configurations

    Rüdiger Birkner, Dana Drachsler-Cohen, Laurent Vanbever, and Martin Vechev. Config2spec: Mining network specifications from network configurations. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, 2020

  22. [22]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  23. [23]

    Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan

    Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The internet topology zoo.IEEE Journal on Selected Areas in Communications, 29(9):1765–1775, 2011

  24. [24]

    Network configuration synthesis with abstract topolo- gies

    Ryan Beckett, Ratul Mahajan, Todd Millstein, Jitendra Padhye, and David Walker. Network configuration synthesis with abstract topolo- gies. InProceedings of the 38th ACM SIGPLAN Conference on Program- ming Language Design and Implementation, PLDI 2017, page 437–451, New York, NY, USA, 2017. Association for Computing Machinery

  25. [25]

    Large language models can be easily distracted by irrelevant context, 2023

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context, 2023

  26. [26]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How 13 Preprint, , Protogeros et al. language models use long contexts, 2023

  27. [27]

    Free Software Foundation, 2002

    David Mackenzie, Paul Eggert, and Jim Meyering.Comparing and Merging Files with GNU Diff and Patch. Free Software Foundation, 2002

  28. [28]

    Diff-xyz: A benchmark for evaluating diff understanding, 2025

    Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, and Alexander Bezzubov. Diff-xyz: A benchmark for evaluating diff understanding, 2025

  29. [29]

    SWE-bench: Can language models resolve real-world github issues? InICLR, 2024

    Carlos E Jimenez, John Yang, et al. SWE-bench: Can language models resolve real-world github issues? InICLR, 2024

  30. [30]

    Aider: Ai pair programming in your terminal, 2023

    Paul Gauthier. Aider: Ai pair programming in your terminal, 2023

  31. [31]

    Binary codes capable of correcting deletions, insertions and reversals.Soviet Physics Doklady, 10:707, February 1966

    Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals.Soviet Physics Doklady, 10:707, February 1966

  32. [32]

    Matharena: Evaluating llms on uncontaminated math competitions, 2026

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, 2026

  33. [33]

    Caip: Detecting router misconfigurations with context-aware iterative prompting of llms, 2024

    Xi Jiang, Aaron Gember-Jacobson, and Nick Feamster. Caip: Detecting router misconfigurations with context-aware iterative prompting of llms, 2024

  34. [34]

    A holistic view of ai-driven network incident management

    Pouya Hamadanian, Behnaz Arzani, Sadjad Fouladi, Siva Kesava Reddy Kakarla, Rodrigo Fonseca, Denizcan Billor, Ahmad Cheema, Edet Nkposong, and Ranveer Chandra. A holistic view of ai-driven network incident management. InProceedings of the 22nd ACM Workshop on Hot Topics in Networks, HotNets ’23, page 180–188, New York, NY, USA,

  35. [35]

    Association for Computing Machinery

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  37. [37]

    Least-to-most prompting enables complex reasoning in large language models, 2023

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023

  38. [38]

    Information processing systems – Open Systems Interconnection – Basic Reference Model – Part 4: Management Framework, 1989

  39. [39]

    Small language models are the future of agentic ai, 2025

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. Small language models are the future of agentic ai, 2025

  40. [40]

    Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

  41. [41]

    Brighten Godfrey, and Samuel Talmadge King

    Haohui Mai, Ahmed Khurshid, Rachit Agarwal, Matthew Caesar, P. Brighten Godfrey, and Samuel Talmadge King. Debugging the data plane with anteater.SIGCOMM Comput. Commun. Rev., 41(4):290–301, August 2011

  42. [42]

    Baxbench: Can llms generate correct and secure backends?, 2025

    Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Max- imilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends?, 2025

  43. [43]

    Wang, Sadjad Fouladi, Francis Y

    Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, and Zaoxing Liu. Netarena: Dynamic benchmarks for ai agents in network automation, 2026

  44. [44]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  45. [45]

    Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021

  46. [46]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

  47. [47]

    Metagpt: Meta programming for a multi-agent collaborative framework, 2024

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024. 14 Benchmarking LLM-driven configuration repair Preprint, , A FAULT ...