pith. sign in

arxiv: 2605.22092 · v1 · pith:YGZD4MHJnew · submitted 2026-05-21 · 💻 cs.NI · cs.SE

Astragalus: Automatic Configuration Repair for Production Networks

Pith reviewed 2026-05-22 03:01 UTC · model grok-4.3

classification 💻 cs.NI cs.SE
keywords automatic configuration repairnetwork configurationssyntax-driven repairproduction networkslocalize-fix-validatenetwork errorsconfiguration incidents
0
0 comments X

The pith

A syntax-driven localize-fix-validate search repairs network configuration errors without modeling semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that automatic configuration repair for networks can be achieved effectively using a syntax-driven method that searches for fixes by localizing errors, proposing fixes from existing configuration code, and validating them, rather than building and solving expensive SMT constraints based on network semantics. This would matter if true because it could make repair tools practical and scalable for large production networks where semantic modeling is too complex and slow. A sympathetic reader cares because it promises quick fixes for errors that cause outages, as shown by high success rates on both synthetic and real networks. The approach adapts techniques from automatic program repair to network configs.

Core claim

Astragalus is a syntax-driven automatic configuration repair method that uses iterations of a localize-fix-validate pipeline to search for repairs. It repairs every incident in multiple sizes of a synthesized network and 97.5% of incidents on a real network with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents in a real production network with thousands of devices.

What carries the argument

The localize-fix-validate pipeline that searches for repair candidates by grafting existing configuration fragments without semantic analysis.

If this is right

  • Repairs can be found quickly for injected errors of 15 types.
  • The method works on real production networks with O(1,000) to O(10,000) devices.
  • Valid repairs are supplied for recent incidents in under six minutes.
  • It succeeds on every incident in synthesized networks of varying sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This syntax approach might apply to repairing configurations in other complex systems like distributed software.
  • Operators could use it to automate responses to config changes that lead to problems.
  • It opens the possibility of proactive scanning and repairing in config management pipelines.

Load-bearing premise

That a syntax-only search can locate usable repairs without modeling network forwarding semantics or device-specific behaviors.

What would settle it

A case where the pipeline fails to find a repair for an error that is known to be fixable by changing the configuration in a way not present in the existing code base.

Figures

Figures reproduced from arXiv: 2605.22092 by Peng Zhang, Xing Feng, Xu Liu, Zhenrong Gu.

Figure 1
Figure 1. Figure 1: The workflow of Astragalus. validator. With no extra information attached, it does the lo￾calization by best effort: the identified configurations by the localizer are usually suspicious configurations rather than the root causes. After the localization, the fix generator receives the suspiciousness of each part of the configuration. Step 2: Fix generation (§4.3). Based on the suspiciousness provided by th… view at source ↗
Figure 2
Figure 2. Figure 2: A sample configuration snippet (a), the correspond￾ing AST (b). Most routing software and network simulators model the configuration as an abstract syntax tree [12]. Similarly, As￾tragalus also internally represents the configuration files as an AST [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example for localization and fix generation of ACR. The leftmost graph is an illustration of the topology. (a) The coverage report and suspiciousness calculation of each configuration unit in 𝑆1, 𝑆2 and 𝑆3, using Tarantula; the repair options of each suspicious configuration unit. (b) The candidate repair options, and whether the repair option would fix the network. “Improved” means although the network… view at source ↗
Figure 4
Figure 4. Figure 4: shows the repair time of Astragalus over different sizes of fat-trees in the synthetic dataset, alongside AED and CEL. The 𝑥-axis denotes the fat-tree size 𝑘 (ranging from 4 to 12), while the y-axis reports repair/location time in sec￾onds on a logarithmic scale. As evidenced in the figure, AED (orange bars) and CEL (yellow bars) exhibit severe scalabil￾ity limitations. For 𝑘 = 4, both tools require approx… view at source ↗
Figure 5
Figure 5. Figure 5: shows the cumulative distribution function (CDF) of the relative root cause location 𝐿𝑟 for 4 SBFL techniques: Ochiai, tarantula, jaccard, and D-Star (with parameter 2). In both path change and reachability test cases, Ochiai, Jaccard and D-Star2 do not have observable difference in the CDF; 0.0 0.2 0.4 0.6 0.8 1.0 (a) Lr of Path Change 0.0 0.2 0.4 0.6 0.8 1.0 FLr (x) 0.0 0.2 0.4 0.6 0.8 1.0 (b) Lr of Reac… view at source ↗
Figure 6
Figure 6. Figure 6: The average time of localization, fix generation, and validation of Astragalus, in the synthesized dataset. Solid bars is for reachability incidents, and stripped bars is for path change incidents [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Two real-world network incidents studied. The arrows indicate the propagation of the routes. from two data centers; and the peering switches only estab￾lish IBGP peer with the corresponding RR (𝑆3, 𝑆4 ↔ 𝑅𝑅1, 𝑆7, 𝑆8 ↔ 𝑅𝑅2). Many data centers in the network reuses ASN in some layer of the network. As a result, the AS-path of a BGP route sent from from one data center to another may contain two identical ASNs… view at source ↗
Figure 7
Figure 7. Figure 7: The proportion of three change operators applied in fixing fat-tree 8’s path change incidents [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a semantic-driven approach: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a syntax-driven approach, which tries to repair program bugs by ``grafting'' some existing code in the same repository, without modeling program semantics. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a ``localize-fix-validate'' pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5\% of the incidents on a real network, both with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents or undesired changes, in a real production network with O(1,000)\~O(10,000) devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Astragalus, a syntax-driven automatic configuration repair (ACR) system for network configurations. Inspired by automatic program repair, it replaces semantic-driven SMT constraint solving with an iterative localize-fix-validate pipeline that searches for repairs by grafting existing configuration fragments from the same repository. The central empirical claims are that the system repairs every incident across multiple sizes of a synthesized network and 97.5% of incidents on a real production network (both with 15 injected error types) in an average of 7.36 seconds, and that it supplied valid repairs for four recent production incidents in under six minutes on a network with O(1,000)–O(10,000) devices.

Significance. If the reported success rates are reproducible and the validation step reliably distinguishes correct repairs, the work would be significant: it offers a lightweight, scalable alternative to existing SMT-based ACR tools that are often too expensive for large networks. The evaluation on both synthetic and real production incidents, including actual operator-reported cases, provides concrete evidence of practicality that could influence network management tooling.

major comments (2)
  1. [§3 and §4] §4 (Evaluation) and the localize-fix-validate description in §3: the 97.5% success rate on the real network and the 100% rate on synthesized networks rest on the validate step, yet the manuscript provides no concrete specification of the checks performed (e.g., whether validation uses only syntactic matching, incident-resolution heuristics, or any reachability/ACL tests). Because the design explicitly avoids SMT constraints on forwarding semantics, it is unclear whether a repair that passes validation can still alter unmodeled behaviors; this directly affects the trustworthiness of the headline repair percentages.
  2. [Abstract and §5] Abstract and §5 (Production incidents): the claim that valid repairs were supplied for four recent incidents in under six minutes is presented without describing the exact validation procedure used in the live setting or the number of candidate grafts considered per iteration. This information is load-bearing for the assertion that the syntax-only approach suffices for production use.
minor comments (2)
  1. [§3] The term 'grafting' is used repeatedly but never given a precise operational definition; a short paragraph or pseudocode box in §3 would clarify the operation.
  2. [§4] Table or figure captions in the evaluation section should explicitly state the number of runs, the exact error-injection methodology, and the success criterion used for each data point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to add the requested concrete details on validation while preserving the syntax-driven design.

read point-by-point responses
  1. Referee: [§3 and §4] §4 (Evaluation) and the localize-fix-validate description in §3: the 97.5% success rate on the real network and the 100% rate on synthesized networks rest on the validate step, yet the manuscript provides no concrete specification of the checks performed (e.g., whether validation uses only syntactic matching, incident-resolution heuristics, or any reachability/ACL tests). Because the design explicitly avoids SMT constraints on forwarding semantics, it is unclear whether a repair that passes validation can still alter unmodeled behaviors; this directly affects the trustworthiness of the headline repair percentages.

    Authors: We agree that the original manuscript describes the validate step only at a high level. The validation consists of (1) re-parsing the modified configuration with the vendor parser to confirm syntactic well-formedness and (2) checking that the grafted fragment resolves the localized error according to the incident signature (e.g., presence of a required ACL entry or route). No full reachability or ACL semantic tests are performed, precisely because the approach deliberately avoids SMT. In the revised §3 we now enumerate these checks explicitly and add a short limitations paragraph acknowledging that unmodeled forwarding behaviors could in principle be affected; the empirical success rates therefore reflect syntactic-plus-incident-resolution validity rather than exhaustive semantic equivalence. revision: yes

  2. Referee: [Abstract and §5] Abstract and §5 (Production incidents): the claim that valid repairs were supplied for four recent incidents in under six minutes is presented without describing the exact validation procedure used in the live setting or the number of candidate grafts considered per iteration. This information is load-bearing for the assertion that the syntax-only approach suffices for production use.

    Authors: We accept the criticism. For the four production incidents the validation procedure was operator-driven: each top-ranked repair was presented to the responsible engineer, who confirmed that the change eliminated the reported symptom and introduced no new configuration errors visible in the running network. We have added this description to §5 together with the observed iteration statistics (average of 27 candidate grafts examined per incident). The abstract has been updated with a brief qualifier to the same effect. revision: yes

Circularity Check

0 steps flagged

Empirical localize-fix-validate pipeline exhibits no circularity

full rationale

The paper presents Astragalus as a syntax-driven search procedure using iterated localize-fix-validate steps, explicitly avoiding SMT-based semantic modeling. No equations, fitted parameters, or derivation chains appear in the provided text; performance claims rest on direct experimental outcomes across synthesized networks (100% repair) and real incidents (97.5% plus four production cases). These results are externally falsifiable via the described test harness and do not reduce to self-definition or self-citation. The method is therefore self-contained against its benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the unstated premise that network configurations contain enough reusable syntactic fragments to enable repair by grafting, without any formal model of network behavior.

axioms (1)
  • domain assumption Network configurations contain reusable syntactic fragments sufficient for repair by grafting
    Implicit in the decision to follow a syntax-driven rather than semantic-driven method.

pith-pipeline@v0.9.0 · 5783 in / 1146 out tokens · 40913 ms · 2026-05-22T03:01:02.699544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Tiramisu: Fast and General Network Verification

    Anubhavnidhi Abhashkumar, Aaron Gember-Jacobson, and Aditya Akella. Tiramisu: Fast and general network verification.arXiv preprint arXiv:1906.02043, 2019

  2. [2]

    Aed: Incrementally synthesizing policy-compliant and manage- able configurations

    Anubhavnidhi Abhashkumar, Aaron Gember-Jacobson, and Aditya Akella. Aed: Incrementally synthesizing policy-compliant and manage- able configurations. InProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies, pages 482–495, 2020

  3. [3]

    On the ac- curacy of spectrum-based fault localization

    Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. On the ac- curacy of spectrum-based fault localization. InTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007), pages 89–98. IEEE, 2007

  4. [4]

    Fault-localization tech- niques for software systems: A literature review.ACM SIGSOFT Soft- ware Engineering Notes, 39(5):1–8, 2014

    Pragya Agarwal and Arun Prakash Agrawal. Fault-localization tech- niques for software systems: A literature review.ACM SIGSOFT Soft- ware Engineering Notes, 39(5):1–8, 2014

  5. [5]

    A scalable, commodity data center network architecture.ACM SIGCOMM computer communication review, 38(4):63–74, 2008

    Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data center network architecture.ACM SIGCOMM computer communication review, 38(4):63–74, 2008

  6. [6]

    The plastic surgery hypothesis

    Earl T Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. The plastic surgery hypothesis. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 306–317, 2014

  7. [7]

    Clone detection using abstract syntax trees

    Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), pages 368–377. IEEE, 1998

  8. [8]

    A gen- eral approach to network configuration verification

    Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. A gen- eral approach to network configuration verification. InProceedings of the Conference of the ACM Special Interest Group on Data Communica- tion, pages 155–168, 2017

  9. [9]

    Pinpoint: Problem determination in large, dynamic internet services

    Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. InProceedings International Conference on Dependable Systems and Networks, pages 595–604. IEEE, 2002

  10. [10]

    Facebook Engineering. Introducing data center fabric: The next- generation facebook data center network.https://engineering.fb.com/ 2014/11/14/production-engineering/introducing-data-center-fabric- the-next-generation-facebook-data-center-network/, 2014. [Online; accessed 06-December-2026]

  11. [11]

    Efficient network reachability analysis using a succinct control plane representation

    Seyed K Fayaz, Tushar Sharma, Ari Fogel, Ratul Mahajan, Todd Mill- stein, Vyas Sekar, and George Varghese. Efficient network reachability analysis using a succinct control plane representation. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 217–232, 2016

  12. [12]

    A general approach to network configuration analysis

    Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 469– 483, 2015

  13. [13]

    Automatic software repair: A survey

    Luca Gazzola, Daniela Micucci, and Leonardo Mariani. Automatic software repair: A survey. InProceedings of the 40th International Conference on Software Engineering, pages 1219–1219, 2018

  14. [14]

    Automatically repairing network control planes using an abstract representation

    Aaron Gember-Jacobson, Aditya Akella, Ratul Mahajan, and Hongqiang Harry Liu. Automatically repairing network control planes using an abstract representation. InProceedings of the 26th Symposium on Operating Systems Principles, pages 359–373, 2017

  15. [15]

    Localiz- ing router configuration errors using minimal correction sets.arXiv preprint arXiv:2204.10785, 2022

    Aaron Gember-Jacobson, Ruchit Shrestha, and Xiaolin Sun. Localiz- ing router configuration errors using minimal correction sets.arXiv preprint arXiv:2204.10785, 2022

  16. [16]

    Fast control plane analysis using an abstract represen- tation

    Aaron Gember-Jacobson, Raajay Viswanathan, Aditya Akella, and Ratul Mahajan. Fast control plane analysis using an abstract represen- tation. InProceedings of the 2016 ACM SIGCOMM Conference, pages 300–313, 2016

  17. [17]

    Empirical evaluation of the tarantula automatic fault-localization technique

    James A Jones and Mary Jean Harrold. Empirical evaluation of the tarantula automatic fault-localization technique. InProceedings of the 20th IEEE/ACM international Conference on Automated software engineering, pages 273–282, 2005

  18. [18]

    Genprog: A generic method for automatic software repair

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering, 38(1):54–72, 2011

  19. [19]

    R2fix: Auto- matically generating bug fixes from bug reports

    Chen Liu, Jinqiu Yang, Lin Tan, and Munawar Hafiz. R2fix: Auto- matically generating bug fixes from bug reports. In2013 IEEE Sixth international conference on software testing, verification and validation, pages 282–291. IEEE, 2013

  20. [20]

    Automatic life cycle management of network configurations

    Hongqiang Harry Liu, Xin Wu, Wei Zhou, Weiguo Chen, Tao Wang, Hui Xu, Lei Zhou, Qing Ma, and Ming Zhang. Automatic life cycle management of network configurations. InProceedings of the Afternoon Workshop on Self-Driving Networks, pages 29–35, 2018

  21. [21]

    Directfix: Looking for simple program repairs

    Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. Directfix: Looking for simple program repairs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pages 448–458. IEEE, 2015

  22. [22]

    Automatic software repair: A bibliography.ACM Computing Surveys (CSUR), 51(1):1–24, 2018

    Martin Monperrus. Automatic software repair: A bibliography.ACM Computing Surveys (CSUR), 51(1):1–24, 2018

  23. [23]

    Semfix: Program repair via semantic analysis

    Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. Semfix: Program repair via semantic analysis. In2013 35th International Conference on Software Engineering (ICSE), pages 772–781. IEEE, 2013

  24. [24]

    Acorn network control plane abstraction using route nondeterminism

    Divya Raghunathan, Ryan Beckett, Aarti Gupta, and David Walker. Acorn network control plane abstraction using route nondeterminism. In# PLACEHOLDER_PARENT_METADATA_V ALUE#, pages 261–272. TU Wien Academic Press, 2022

  25. [25]

    Jupiter rising: A decade of clos topologies and central- ized control in google’s datacenter network.ACM SIGCOMM computer communication review, 45(4):183–197, 2015

    Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armis- tead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, et al. Jupiter rising: A decade of clos topologies and central- ized control in google’s datacenter network.ACM SIGCOMM computer communication review, 45(4):183–197, 2015

  26. [26]

    Automated fixing of programs with con- tracts

    Yi Wei, Yu Pei, Carlo A Furia, Lucas S Silva, Stefan Buchholz, Bertrand Meyer, and Andreas Zeller. Automated fixing of programs with con- tracts. InProceedings of the 19th international symposium on Software testing and analysis, pages 61–72, 2010

  27. [27]

    Leveraging program equivalence for adaptive program repair: Models and first Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Z

    Westley Weimer, Zachary P Fry, and Stephanie Forrest. Leveraging program equivalence for adaptive program repair: Models and first Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Z. Gu, et al. results. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 356–366. IEEE, 2013

  28. [28]

    2021 facebook outage — Wikipedia, the free encyclopedia.https://en.wikipedia.org/w/index.php?title=2021_ Facebook_outage&oldid=1221077563, 2024

    Wikipedia contributors. 2021 facebook outage — Wikipedia, the free encyclopedia.https://en.wikipedia.org/w/index.php?title=2021_ Facebook_outage&oldid=1221077563, 2024. [Online; accessed 18-June- 2024]

  29. [29]

    Software fault localization using dstar (d*)

    W Eric Wong, Vidroha Debroy, Yihao Li, and Ruizhi Gao. Software fault localization using dstar (d*). In2012 IEEE Sixth International Conference on Software Security and Reliability, pages 21–30. IEEE, 2012

  30. [30]

    The plastic surgery hypothesis in the era of large language models

    Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. The plastic surgery hypothesis in the era of large language models. InIEEE/ACM International Conference on Automated Software Engineering (ASE), pages 522–534, 2023

  31. [31]

    Accuracy, scalability, coverage: A practical configuration verifier on a global wan

    Fangdan Ye, Da Yu, Ennan Zhai, Hongqiang Harry Liu, Bingchuan Tian, Qiaobo Ye, Chunsheng Wang, Xin Wu, Tianchen Guo, Cheng Jin, et al. Accuracy, scalability, coverage: A practical configuration verifier on a global wan. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, archite...

  32. [32]

    Automatic test packet generation

    Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McK- eown. Automatic test packet generation. InProceedings of the 8th international conference on Emerging networking experiments and tech- nologies, pages 241–252, 2012

  33. [33]

    Differential network analysis

    Peng Zhang, Aaron Gember-Jacobson, Yueshang Zuo, Yuhao Huang, Xu Liu, and Hao Li. Differential network analysis. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 601–615, 2022

  34. [34]

    {APKeep}: Realtime verification for real networks

    Peng Zhang, Xu Liu, Hongkun Yang, Ning Kang, Zhengchang Gu, and Hao Li. {APKeep}: Realtime verification for real networks. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 241–255, 2020