Astragalus: Automatic Configuration Repair for Production Networks
Pith reviewed 2026-05-22 03:01 UTC · model grok-4.3
The pith
A syntax-driven localize-fix-validate search repairs network configuration errors without modeling semantics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Astragalus is a syntax-driven automatic configuration repair method that uses iterations of a localize-fix-validate pipeline to search for repairs. It repairs every incident in multiple sizes of a synthesized network and 97.5% of incidents on a real network with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents in a real production network with thousands of devices.
What carries the argument
The localize-fix-validate pipeline that searches for repair candidates by grafting existing configuration fragments without semantic analysis.
If this is right
- Repairs can be found quickly for injected errors of 15 types.
- The method works on real production networks with O(1,000) to O(10,000) devices.
- Valid repairs are supplied for recent incidents in under six minutes.
- It succeeds on every incident in synthesized networks of varying sizes.
Where Pith is reading between the lines
- This syntax approach might apply to repairing configurations in other complex systems like distributed software.
- Operators could use it to automate responses to config changes that lead to problems.
- It opens the possibility of proactive scanning and repairing in config management pipelines.
Load-bearing premise
That a syntax-only search can locate usable repairs without modeling network forwarding semantics or device-specific behaviors.
What would settle it
A case where the pipeline fails to find a repair for an error that is known to be fixable by changing the configuration in a way not present in the existing code base.
Figures
read the original abstract
Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a semantic-driven approach: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a syntax-driven approach, which tries to repair program bugs by ``grafting'' some existing code in the same repository, without modeling program semantics. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a ``localize-fix-validate'' pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5\% of the incidents on a real network, both with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents or undesired changes, in a real production network with O(1,000)\~O(10,000) devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Astragalus, a syntax-driven automatic configuration repair (ACR) system for network configurations. Inspired by automatic program repair, it replaces semantic-driven SMT constraint solving with an iterative localize-fix-validate pipeline that searches for repairs by grafting existing configuration fragments from the same repository. The central empirical claims are that the system repairs every incident across multiple sizes of a synthesized network and 97.5% of incidents on a real production network (both with 15 injected error types) in an average of 7.36 seconds, and that it supplied valid repairs for four recent production incidents in under six minutes on a network with O(1,000)–O(10,000) devices.
Significance. If the reported success rates are reproducible and the validation step reliably distinguishes correct repairs, the work would be significant: it offers a lightweight, scalable alternative to existing SMT-based ACR tools that are often too expensive for large networks. The evaluation on both synthetic and real production incidents, including actual operator-reported cases, provides concrete evidence of practicality that could influence network management tooling.
major comments (2)
- [§3 and §4] §4 (Evaluation) and the localize-fix-validate description in §3: the 97.5% success rate on the real network and the 100% rate on synthesized networks rest on the validate step, yet the manuscript provides no concrete specification of the checks performed (e.g., whether validation uses only syntactic matching, incident-resolution heuristics, or any reachability/ACL tests). Because the design explicitly avoids SMT constraints on forwarding semantics, it is unclear whether a repair that passes validation can still alter unmodeled behaviors; this directly affects the trustworthiness of the headline repair percentages.
- [Abstract and §5] Abstract and §5 (Production incidents): the claim that valid repairs were supplied for four recent incidents in under six minutes is presented without describing the exact validation procedure used in the live setting or the number of candidate grafts considered per iteration. This information is load-bearing for the assertion that the syntax-only approach suffices for production use.
minor comments (2)
- [§3] The term 'grafting' is used repeatedly but never given a precise operational definition; a short paragraph or pseudocode box in §3 would clarify the operation.
- [§4] Table or figure captions in the evaluation section should explicitly state the number of runs, the exact error-injection methodology, and the success criterion used for each data point.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to add the requested concrete details on validation while preserving the syntax-driven design.
read point-by-point responses
-
Referee: [§3 and §4] §4 (Evaluation) and the localize-fix-validate description in §3: the 97.5% success rate on the real network and the 100% rate on synthesized networks rest on the validate step, yet the manuscript provides no concrete specification of the checks performed (e.g., whether validation uses only syntactic matching, incident-resolution heuristics, or any reachability/ACL tests). Because the design explicitly avoids SMT constraints on forwarding semantics, it is unclear whether a repair that passes validation can still alter unmodeled behaviors; this directly affects the trustworthiness of the headline repair percentages.
Authors: We agree that the original manuscript describes the validate step only at a high level. The validation consists of (1) re-parsing the modified configuration with the vendor parser to confirm syntactic well-formedness and (2) checking that the grafted fragment resolves the localized error according to the incident signature (e.g., presence of a required ACL entry or route). No full reachability or ACL semantic tests are performed, precisely because the approach deliberately avoids SMT. In the revised §3 we now enumerate these checks explicitly and add a short limitations paragraph acknowledging that unmodeled forwarding behaviors could in principle be affected; the empirical success rates therefore reflect syntactic-plus-incident-resolution validity rather than exhaustive semantic equivalence. revision: yes
-
Referee: [Abstract and §5] Abstract and §5 (Production incidents): the claim that valid repairs were supplied for four recent incidents in under six minutes is presented without describing the exact validation procedure used in the live setting or the number of candidate grafts considered per iteration. This information is load-bearing for the assertion that the syntax-only approach suffices for production use.
Authors: We accept the criticism. For the four production incidents the validation procedure was operator-driven: each top-ranked repair was presented to the responsible engineer, who confirmed that the change eliminated the reported symptom and introduced no new configuration errors visible in the running network. We have added this description to §5 together with the observed iteration statistics (average of 27 candidate grafts examined per incident). The abstract has been updated with a brief qualifier to the same effect. revision: yes
Circularity Check
Empirical localize-fix-validate pipeline exhibits no circularity
full rationale
The paper presents Astragalus as a syntax-driven search procedure using iterated localize-fix-validate steps, explicitly avoiding SMT-based semantic modeling. No equations, fitted parameters, or derivation chains appear in the provided text; performance claims rest on direct experimental outcomes across synthesized networks (100% repair) and real incidents (97.5% plus four production cases). These results are externally falsifiable via the described test harness and do not reduce to self-definition or self-citation. The method is therefore self-contained against its benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Network configurations contain reusable syntactic fragments sufficient for repair by grafting
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Astragalus uses multiple iterations of a 'localize-fix-validate' pipeline... spectrum-based fault localization (SBFL) on configuration AST coverage
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
validation... off-the-shelf configuration verifiers... without modeling network semantics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tiramisu: Fast and General Network Verification
Anubhavnidhi Abhashkumar, Aaron Gember-Jacobson, and Aditya Akella. Tiramisu: Fast and general network verification.arXiv preprint arXiv:1906.02043, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[2]
Aed: Incrementally synthesizing policy-compliant and manage- able configurations
Anubhavnidhi Abhashkumar, Aaron Gember-Jacobson, and Aditya Akella. Aed: Incrementally synthesizing policy-compliant and manage- able configurations. InProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies, pages 482–495, 2020
work page 2020
-
[3]
On the ac- curacy of spectrum-based fault localization
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. On the ac- curacy of spectrum-based fault localization. InTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007), pages 89–98. IEEE, 2007
work page 2007
-
[4]
Pragya Agarwal and Arun Prakash Agrawal. Fault-localization tech- niques for software systems: A literature review.ACM SIGSOFT Soft- ware Engineering Notes, 39(5):1–8, 2014
work page 2014
-
[5]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data center network architecture.ACM SIGCOMM computer communication review, 38(4):63–74, 2008
work page 2008
-
[6]
The plastic surgery hypothesis
Earl T Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. The plastic surgery hypothesis. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 306–317, 2014
work page 2014
-
[7]
Clone detection using abstract syntax trees
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), pages 368–377. IEEE, 1998
work page 1998
-
[8]
A gen- eral approach to network configuration verification
Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. A gen- eral approach to network configuration verification. InProceedings of the Conference of the ACM Special Interest Group on Data Communica- tion, pages 155–168, 2017
work page 2017
-
[9]
Pinpoint: Problem determination in large, dynamic internet services
Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. InProceedings International Conference on Dependable Systems and Networks, pages 595–604. IEEE, 2002
work page 2002
-
[10]
Facebook Engineering. Introducing data center fabric: The next- generation facebook data center network.https://engineering.fb.com/ 2014/11/14/production-engineering/introducing-data-center-fabric- the-next-generation-facebook-data-center-network/, 2014. [Online; accessed 06-December-2026]
work page 2014
-
[11]
Efficient network reachability analysis using a succinct control plane representation
Seyed K Fayaz, Tushar Sharma, Ari Fogel, Ratul Mahajan, Todd Mill- stein, Vyas Sekar, and George Varghese. Efficient network reachability analysis using a succinct control plane representation. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 217–232, 2016
work page 2016
-
[12]
A general approach to network configuration analysis
Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 469– 483, 2015
work page 2015
-
[13]
Automatic software repair: A survey
Luca Gazzola, Daniela Micucci, and Leonardo Mariani. Automatic software repair: A survey. InProceedings of the 40th International Conference on Software Engineering, pages 1219–1219, 2018
work page 2018
-
[14]
Automatically repairing network control planes using an abstract representation
Aaron Gember-Jacobson, Aditya Akella, Ratul Mahajan, and Hongqiang Harry Liu. Automatically repairing network control planes using an abstract representation. InProceedings of the 26th Symposium on Operating Systems Principles, pages 359–373, 2017
work page 2017
-
[15]
Aaron Gember-Jacobson, Ruchit Shrestha, and Xiaolin Sun. Localiz- ing router configuration errors using minimal correction sets.arXiv preprint arXiv:2204.10785, 2022
-
[16]
Fast control plane analysis using an abstract represen- tation
Aaron Gember-Jacobson, Raajay Viswanathan, Aditya Akella, and Ratul Mahajan. Fast control plane analysis using an abstract represen- tation. InProceedings of the 2016 ACM SIGCOMM Conference, pages 300–313, 2016
work page 2016
-
[17]
Empirical evaluation of the tarantula automatic fault-localization technique
James A Jones and Mary Jean Harrold. Empirical evaluation of the tarantula automatic fault-localization technique. InProceedings of the 20th IEEE/ACM international Conference on Automated software engineering, pages 273–282, 2005
work page 2005
-
[18]
Genprog: A generic method for automatic software repair
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering, 38(1):54–72, 2011
work page 2011
-
[19]
R2fix: Auto- matically generating bug fixes from bug reports
Chen Liu, Jinqiu Yang, Lin Tan, and Munawar Hafiz. R2fix: Auto- matically generating bug fixes from bug reports. In2013 IEEE Sixth international conference on software testing, verification and validation, pages 282–291. IEEE, 2013
work page 2013
-
[20]
Automatic life cycle management of network configurations
Hongqiang Harry Liu, Xin Wu, Wei Zhou, Weiguo Chen, Tao Wang, Hui Xu, Lei Zhou, Qing Ma, and Ming Zhang. Automatic life cycle management of network configurations. InProceedings of the Afternoon Workshop on Self-Driving Networks, pages 29–35, 2018
work page 2018
-
[21]
Directfix: Looking for simple program repairs
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. Directfix: Looking for simple program repairs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pages 448–458. IEEE, 2015
work page 2015
-
[22]
Automatic software repair: A bibliography.ACM Computing Surveys (CSUR), 51(1):1–24, 2018
Martin Monperrus. Automatic software repair: A bibliography.ACM Computing Surveys (CSUR), 51(1):1–24, 2018
work page 2018
-
[23]
Semfix: Program repair via semantic analysis
Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. Semfix: Program repair via semantic analysis. In2013 35th International Conference on Software Engineering (ICSE), pages 772–781. IEEE, 2013
work page 2013
-
[24]
Acorn network control plane abstraction using route nondeterminism
Divya Raghunathan, Ryan Beckett, Aarti Gupta, and David Walker. Acorn network control plane abstraction using route nondeterminism. In# PLACEHOLDER_PARENT_METADATA_V ALUE#, pages 261–272. TU Wien Academic Press, 2022
work page 2022
-
[25]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armis- tead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, et al. Jupiter rising: A decade of clos topologies and central- ized control in google’s datacenter network.ACM SIGCOMM computer communication review, 45(4):183–197, 2015
work page 2015
-
[26]
Automated fixing of programs with con- tracts
Yi Wei, Yu Pei, Carlo A Furia, Lucas S Silva, Stefan Buchholz, Bertrand Meyer, and Andreas Zeller. Automated fixing of programs with con- tracts. InProceedings of the 19th international symposium on Software testing and analysis, pages 61–72, 2010
work page 2010
-
[27]
Westley Weimer, Zachary P Fry, and Stephanie Forrest. Leveraging program equivalence for adaptive program repair: Models and first Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Z. Gu, et al. results. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 356–366. IEEE, 2013
work page 2018
-
[28]
Wikipedia contributors. 2021 facebook outage — Wikipedia, the free encyclopedia.https://en.wikipedia.org/w/index.php?title=2021_ Facebook_outage&oldid=1221077563, 2024. [Online; accessed 18-June- 2024]
work page 2021
-
[29]
Software fault localization using dstar (d*)
W Eric Wong, Vidroha Debroy, Yihao Li, and Ruizhi Gao. Software fault localization using dstar (d*). In2012 IEEE Sixth International Conference on Software Security and Reliability, pages 21–30. IEEE, 2012
work page 2012
-
[30]
The plastic surgery hypothesis in the era of large language models
Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. The plastic surgery hypothesis in the era of large language models. InIEEE/ACM International Conference on Automated Software Engineering (ASE), pages 522–534, 2023
work page 2023
-
[31]
Accuracy, scalability, coverage: A practical configuration verifier on a global wan
Fangdan Ye, Da Yu, Ennan Zhai, Hongqiang Harry Liu, Bingchuan Tian, Qiaobo Ye, Chunsheng Wang, Xin Wu, Tianchen Guo, Cheng Jin, et al. Accuracy, scalability, coverage: A practical configuration verifier on a global wan. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, archite...
work page 2020
-
[32]
Automatic test packet generation
Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McK- eown. Automatic test packet generation. InProceedings of the 8th international conference on Emerging networking experiments and tech- nologies, pages 241–252, 2012
work page 2012
-
[33]
Peng Zhang, Aaron Gember-Jacobson, Yueshang Zuo, Yuhao Huang, Xu Liu, and Hao Li. Differential network analysis. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 601–615, 2022
work page 2022
-
[34]
{APKeep}: Realtime verification for real networks
Peng Zhang, Xu Liu, Hongkun Yang, Ning Kang, Zhengchang Gu, and Hao Li. {APKeep}: Realtime verification for real networks. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 241–255, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.